环境信息
Centos7 5.4.221-1.el7
Docker Engine - Community 20.10.9
本文档中的日志分析主要依赖于 journald
服务记录的日志,因此首先需要对 `journald` 服务记录的日志进行持久化配置
场景分析 k8s worker 节点经常死机(奔溃) 环境信息
Centos7 5.4.221-1.el7
Docker Engine - Community 20.10.9
Kubernetes v1.24.7
k8s 节点经常出现无响应(死机),重启才能恢复正常。重启恢复后,检查系统 messages
日志。
/var/log/messages Feb 10 12:21:40 k8s-work1 kernel: INFO: task dockerd:1443 blocked for more than 368 seconds. Feb 10 12:21:40 k8s-work1 kernel: Tainted: G E 5.4.221-1.el7.elrepo.x86_64 #1 Feb 10 12:21:40 k8s-work1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Feb 10 12:21:40 k8s-work1 kernel: dockerd D 0 1443 1 0x00004080 Feb 10 12:21:40 k8s-work1 kernel: Call Trace: Feb 10 12:21:40 k8s-work1 kernel: __schedule+0x2d2/0x730 Feb 10 12:21:40 k8s-work1 kernel: schedule+0x42/0xb0 Feb 10 12:21:40 k8s-work1 kernel: wb_wait_for_completion+0x56/0x90 Feb 10 12:21:40 k8s-work1 kernel: ? finish_wait+0x80/0x80 Feb 10 12:21:40 k8s-work1 kernel: sync_inodes_sb+0xd4/0x2c0 Feb 10 12:21:40 k8s-work1 kernel: ? __filemap_fdatawrite_range+0xf1/0x110 Feb 10 12:21:40 k8s-work1 kernel: sync_filesystem+0x5f/0xa0 Feb 10 12:21:40 k8s-work1 kernel: ovl_sync_fs+0x39/0x60 [overlay] Feb 10 12:21:40 k8s-work1 kernel: sync_filesystem+0x79/0xa0 Feb 10 12:21:40 k8s-work1 kernel: generic_shutdown_super+0x27/0x110 Feb 10 12:21:40 k8s-work1 kernel: kill_anon_super+0x18/0x30 Feb 10 12:21:40 k8s-work1 kernel: deactivate_locked_super+0x3b/0x80 Feb 10 12:21:40 k8s-work1 kernel: deactivate_super+0x3e/0x50 Feb 10 12:21:40 k8s-work1 kernel: cleanup_mnt+0x109/0x160 Feb 10 12:21:40 k8s-work1 kernel: __cleanup_mnt+0x12/0x20 Feb 10 12:21:40 k8s-work1 kernel: task_work_run+0x8f/0xb0 Feb 10 12:21:40 k8s-work1 kernel: exit_to_usermode_loop+0x10c/0x130 Feb 10 12:21:40 k8s-work1 kernel: do_syscall_64+0x170/0x1b0 Feb 10 12:21:40 k8s-work1 kernel: entry_SYSCALL_64_after_hwframe+0x5c/0xc1 Feb 10 12:21:40 k8s-work1 kernel: RIP: 0033:0x55cf9a53e13b Feb 10 12:21:40 k8s-work1 kernel: Code: Bad RIP value. Feb 10 12:21:40 k8s-work1 kernel: RSP: 002b:000000c252fea778 EFLAGS: 00000212 ORIG_RAX: 00000000000000a6 Feb 10 12:21:40 k8s-work1 kernel: RAX: 0000000000000000 RBX: 000000c000070800 RCX: 000055cf9a53e13b Feb 10 12:21:40 k8s-work1 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c2d000c3f0 Feb 10 12:21:40 k8s-work1 kernel: RBP: 000000c252fea7d0 R08: 0000000000000000 R09: 0000000000000000 Feb 10 12:21:40 k8s-work1 kernel: R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000000 Feb 10 12:21:40 k8s-work1 kernel: R13: 0000000000000001 R14: 000000000000000a R15: ffffffffffffffff
从日志中可看出,dockerd
进程处于 `D` 状态 ,说明 dockerd
在等待 IO 操作,根据进程调用的栈信息,显示存在对 overlay
文件系统的同步操作,初步猜测,可能是因为 overlay
文件系统中的某些操作未能及时完成,导致了 dockerd
进程的阻塞。
继续检查日志,看到系统连接 NFS 服务超时,怀疑可能是因为 NFS 异常导致。
/var/log/messages Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, timed out Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, still trying Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, timed out Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, still trying Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, timed out