Linux 系统崩溃问题分析

环境信息

  • Centos7 5.4.221-1.el7
  • Docker Engine - Community 20.10.9

本文档中的日志分析主要依赖于 journald 服务记录的日志,因此首先需要对 `journald` 服务记录的日志进行持久化配置

场景分析

k8s worker 节点经常死机(奔溃)

环境信息

  • Centos7 5.4.221-1.el7
  • Docker Engine - Community 20.10.9
  • Kubernetes v1.24.7

k8s 节点经常出现无响应(死机),重启才能恢复正常。重启恢复后,检查系统 messages 日志。

/var/log/messages
Feb 10 12:21:40 k8s-work1 kernel: INFO: task dockerd:1443 blocked for more than 368 seconds.
Feb 10 12:21:40 k8s-work1 kernel: Tainted: G E 5.4.221-1.el7.elrepo.x86_64 #1
Feb 10 12:21:40 k8s-work1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 10 12:21:40 k8s-work1 kernel: dockerd D 0 1443 1 0x00004080
Feb 10 12:21:40 k8s-work1 kernel: Call Trace:
Feb 10 12:21:40 k8s-work1 kernel: __schedule+0x2d2/0x730
Feb 10 12:21:40 k8s-work1 kernel: schedule+0x42/0xb0
Feb 10 12:21:40 k8s-work1 kernel: wb_wait_for_completion+0x56/0x90
Feb 10 12:21:40 k8s-work1 kernel: ? finish_wait+0x80/0x80
Feb 10 12:21:40 k8s-work1 kernel: sync_inodes_sb+0xd4/0x2c0
Feb 10 12:21:40 k8s-work1 kernel: ? __filemap_fdatawrite_range+0xf1/0x110
Feb 10 12:21:40 k8s-work1 kernel: sync_filesystem+0x5f/0xa0
Feb 10 12:21:40 k8s-work1 kernel: ovl_sync_fs+0x39/0x60 [overlay]
Feb 10 12:21:40 k8s-work1 kernel: sync_filesystem+0x79/0xa0
Feb 10 12:21:40 k8s-work1 kernel: generic_shutdown_super+0x27/0x110
Feb 10 12:21:40 k8s-work1 kernel: kill_anon_super+0x18/0x30
Feb 10 12:21:40 k8s-work1 kernel: deactivate_locked_super+0x3b/0x80
Feb 10 12:21:40 k8s-work1 kernel: deactivate_super+0x3e/0x50
Feb 10 12:21:40 k8s-work1 kernel: cleanup_mnt+0x109/0x160
Feb 10 12:21:40 k8s-work1 kernel: __cleanup_mnt+0x12/0x20
Feb 10 12:21:40 k8s-work1 kernel: task_work_run+0x8f/0xb0
Feb 10 12:21:40 k8s-work1 kernel: exit_to_usermode_loop+0x10c/0x130
Feb 10 12:21:40 k8s-work1 kernel: do_syscall_64+0x170/0x1b0
Feb 10 12:21:40 k8s-work1 kernel: entry_SYSCALL_64_after_hwframe+0x5c/0xc1
Feb 10 12:21:40 k8s-work1 kernel: RIP: 0033:0x55cf9a53e13b
Feb 10 12:21:40 k8s-work1 kernel: Code: Bad RIP value.
Feb 10 12:21:40 k8s-work1 kernel: RSP: 002b:000000c252fea778 EFLAGS: 00000212 ORIG_RAX: 00000000000000a6
Feb 10 12:21:40 k8s-work1 kernel: RAX: 0000000000000000 RBX: 000000c000070800 RCX: 000055cf9a53e13b
Feb 10 12:21:40 k8s-work1 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 000000c2d000c3f0
Feb 10 12:21:40 k8s-work1 kernel: RBP: 000000c252fea7d0 R08: 0000000000000000 R09: 0000000000000000
Feb 10 12:21:40 k8s-work1 kernel: R10: 0000000000000000 R11: 0000000000000212 R12: 0000000000000000
Feb 10 12:21:40 k8s-work1 kernel: R13: 0000000000000001 R14: 000000000000000a R15: ffffffffffffffff

从日志中可看出,dockerd 进程处于 `D` 状态,说明 dockerd 在等待 IO 操作,根据进程调用的栈信息,显示存在对 overlay 文件系统的同步操作,初步猜测,可能是因为 overlay 文件系统中的某些操作未能及时完成,导致了 dockerd 进程的阻塞。

继续检查日志,看到系统连接 NFS 服务超时,怀疑可能是因为 NFS 异常导致。

/var/log/messages
Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, timed out
Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, still trying
Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, timed out
Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, still trying
Feb 10 12:21:40 k8s-work1 kernel: nfs: server 172.31.88.9 not responding, timed out