$ kubectl get pods NAME READY STATUS RESTARTS AGE test-centos7-7cc5dc6987-jz486 0/1 CrashLoopBackOff 8 (111s ago) 17m
查看 Pod 详细信息
$ kubectl describe pod test-centos7-7cc5dc6987-jz486 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 18m default-scheduler Successfully assigned default/test-centos7-7cc5dc6987-jz486 to ops-kubernetes3 Normal Pulled 16m (x5 over 18m) kubelet Container image "centos:centos7.9.2009" already present on machine Normal Created 16m (x5 over 18m) kubelet Created container centos7 Normal Started 16m (x5 over 18m) kubelet Started container centos7 Warning BackOff 3m3s (x71 over 18m) kubelet Back-off restarting failed container
定位中也可以使用 kubectl describe pod 命令检查 Pod 的退出状态码。Kubernetes 中的 Pod ExitCode 状态码是容器退出时返回的退出状态码,这个状态码通常用来指示容器的执行结果,以便 Kubernetes 和相关工具可以根据它来采取后续的操作。以下是一些常见的 ExitCode 状态码说明:
$ kubectl get pods NAME READY STATUS RESTARTS AGE front-7df8ccc4c7-xhp6s 0/1 Error 0 5h42m
检查 Pod 的具体信息
$ kubectl describe pod front-7df8ccc4c7-xhp6s ... Status: Failed Reason: Evicted Message: The node was low on resource: ephemeral-storage. Container php was using 394, which exceeds its request of 0. ...
其中包含异常的关键信息:Status: Failed,Reason: Evicted,具体原因为 The node was low on resource: ephemeral-storage
检查节点上的 Kuberlet 日志,搜索关键字 evicte 或者 disk ,也可以看到系统上文件系统空间使用率超过了阈值
$ journalctl -u kubelet | grep -i -e disk -e evict image_gc_manager.go:310] "Dis usage on image filesystem is over the high threshold, trying to free bytes down to the low threshold" usage=85 highThreshold=85 amountToFree=5122092236 lowThreshold=80 eviction_manager.go:349] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage" eviction_manager.go:338] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
可能原因 :
根据以上信息,可知 Pod 异常是因为 The node was low on resource: ephemeral-storage,表示 临时存储资源 不足导致节点处于 Tainted ,其上的 Pod 被驱逐(Evicted)
$ kubectl get pods NAME READY STATUS RESTARTS AGE admin-cbb479556-j9qg2 0/1 Init:0/1 0 3m37s
查看 Pod 的详细描述信息
$ kubectl describe pod admin-cbb479556-j9qg2 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3m41s default-scheduler Successfully assigned admin-cbb479556-j9qg2 to k8s-work2 Warning FailedMount 99s kubelet Unable to attach or mount volumes: unmounted volumes=[logs], unattached volumes=[wwwroot kube-api-access-z8745 logs]: timed out waiting for the condition Warning FailedMount 42s kubelet MountVolume.SetUp failed for volume "uat-nfs-pv" : mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 34.230.1.1:/data/NFSDataHome /var/lib/kubelet/pods/9d9a4807-706c-4369-b8be-b5727ee6aa8f/volumes/kubernetes.io~nfs/uat-nfs-pv Output: mount.nfs: Connection timed out
根据 Events 中输出的信息,MountVolume.SetUp failed for volume "uat-nfs-pv" : mount failed: exit status 32,显示挂载卷失败,输出中包含了挂载卷时使用的命令和参数(mount -t nfs 34.230.1.1:/data/NFSDataHome /var/lib/kubelet/pods/9d9a4807-706c-4369-b8be-b5727ee6aa8f/volumes/kubernetes.io~nfs/uat-nfs-pv)及命令失败后的返回结果(mount.nfs: Connection timed out)
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m24s default-scheduler Successfully assigned prometheus/node-exporter-nxz9v to master Warning FailedCreatePodContainer 1s (x26 over 5m24s) kubelet unable to ensure pod container exists: failed to create container for [kubepods besteffort pode526f19a-57d6-417c-ba5a-fb0f232d31c6] : dbus: connection closed by user
错误信息显示为 unable to ensure pod container exists: failed to create container for [kubepods besteffort pode526f19a-57d6-417c-ba5a-fb0f232d31c6] : dbus: connection closed by user
查看 kubelet 日志,显示同样的日志
$ journalctl -u kubelet master kubelet[1160]: E0707 14:40:55.036424 1160 qos_container_manager_linux.go:328] "Failed to update QoS cgroup configuration" err="dbus: connection closed by user" master kubelet[1160]: I0707 14:40:55.036455 1160 qos_container_manager_linux.go:138] "Failed to reserve QoS requests" err="dbus: connection closed by user" master kubelet[1160]: E0707 14:41:00.263041 1160 qos_container_manager_linux.go:328] "Failed to update QoS cgroup configuration" err="dbus: connection closed by user" master kubelet[1160]: E0707 14:41:00.263152 1160 pod_workers.go:190] "Error syncing pod, skipping" err="failed to ensure that the pod: 0cdaf660-bb6a-40ee-99ae-21dff3b55411 cgroups exist and are correctly applied: failed to create container for [kubepods besteffort pod0cdaf660-bb6a-40ee-99ae-21dff3b55411] : dbus: connection closed by user" pod="prometheus/node-exporter-rcd8x" podUID=0cdaf660-bb6a-40ee-99ae-21dff3b55411
# kubectl describe pod -n cattle-system cattle-cluster-agent-7d766b5476-hsq45 ... FailedCreatePodSandBox 82s (x4 over 85s) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "2d58156e838349a79da91e0a6d8bccdec0e62c5f5c9ca6a1c30af6186d6253b1" network for pod "cattle-cluster-agent-7d766b5476-hsq45": networkPlugin cni failed to set up pod "cattle-cluster-agent-7d766b5476-hsq45_cattle-system" network: failed to delegate add: failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24
关键信息 failed to set bridge addr: "cni0" already has an IP address different from 10.244.2.1/24。
检查节点上的 IP 信息,发现 flannel.1 网段和 cni0 网段不一致。可能因为 flannel 读取的配置错误,重启节点后恢复。
# ip add 4: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN group default link/ether b2:b1:12:2d:8c:66 brd ff:ff:ff:ff:ff:ff inet 10.244.2.0/32 scope global flannel.1 valid_lft forever preferred_lft forever inet6 fe80::b0b1:12ff:fe2d:8c66/64 scope link valid_lft forever preferred_lft forever 5: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UP group default qlen 1000 link/ether ca:88:b1:51:0f:02 brd ff:ff:ff:ff:ff:ff inet 10.244.0.1/24 brd 10.244.2.255 scope global cni0 valid_lft forever preferred_lft forever inet6 fe80::c888:b1ff:fe51:f02/64 scope link valid_lft forever preferred_lft forever
PodInitializing
新部署的 Pod 状态一直处于 PodInitializing
# kubectl get pods ops ops-admin-5656d7bb64-mpqz5 0/2 PodInitializing 0 2m55s
# kubectl logs -n kube-system kube-scheduler-fm-k8s-c1-master1 | tail -n 20 reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Pod: failed to list *v1.Pod: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.PersistentVolumeClaim: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolumeClaim: failed to list *v1.PersistentVolumeClaim: Unauthorized leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.StatefulSet: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StatefulSet: failed to list *v1.StatefulSet: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.ReplicaSet: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.ReplicaSet: failed to list *v1.ReplicaSet: Unauthorized leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.CSINode: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.CSINode: failed to list *v1.CSINode: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.StorageClass: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.StorageClass: failed to list *v1.StorageClass: Unauthorized leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Unauthorized leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.Service: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Unauthorized reflector.go:324] vendor/k8s.io/client-go/informers/factory.go:134: failed to list *v1.PersistentVolume: Unauthorized reflector.go:138] vendor/k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.PersistentVolume: failed to list *v1.PersistentVolume: Unauthorized leaderelection.go:330] error retrieving resource lock kube-system/kube-scheduler: Unauthorized
# kubeadm certs check-expiration [check-expiration] Reading configuration from the cluster... [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED admin.conf Dec 07, 2024 06:05 UTC 364d ca no apiserver Dec 07, 2024 07:17 UTC 364d ca no apiserver-etcd-client Dec 07, 2024 07:15 UTC 364d etcd-ca no apiserver-kubelet-client Dec 07, 2024 07:15 UTC 364d ca no controller-manager.conf Dec 07, 2024 06:05 UTC 364d ca no etcd-healthcheck-client Dec 07, 2024 07:15 UTC 364d etcd-ca no etcd-peer Dec 07, 2024 07:15 UTC 364d etcd-ca no etcd-server Dec 07, 2024 07:15 UTC 364d etcd-ca no front-proxy-client Dec 07, 2024 07:15 UTC 364d front-proxy-ca no scheduler.conf Dec 07, 2024 06:05 UTC 364d ca no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED ca Dec 03, 2032 09:50 UTC 8y no etcd-ca Dec 05, 2033 07:15 UTC 9y no front-proxy-ca Dec 05, 2033 07:15 UTC 9y no
clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error". Reconnecting... authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-08T08:40:26Z is after 2023-12-06T09:58:58Z, verifying certificate SN=4790061324473323615, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-08T08:40:26Z is after 2023-12-06T09:58:58Z]"
检查 etcd 日志,日志中显示找不到证书:open /etc/kubernetes/pki/etcd/peer.crt: no such file or directory
{"level":"warn","ts":"2023-12-08T07:54:23.780Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"172.31.21.3:30426","server-name":"","error":"open /etc/kubernetes/pki/etcd/peer.crt: no such file or directory"} {"level":"warn","ts":"2023-12-08T07:54:24.195Z","caller":"embed/config_logging.go:169","msg":"rejected connection","remote-addr":"172.31.19.164:28650","server-name":"","error":"open /etc/kubernetes/pki/etcd/peer.crt: no such file or directory"}
堆叠(Stack)高可用模式下 etcd 组件启动时会挂载 Master 节点的 /etc/kubernetes/pki/etcd/ 目录作为自己的证书文件,具体配置可以查看静态 Pod 的配置 /etc/kubernetes/manifests/
PLEG is not healthy: pleg was last seen active 10m13.755045415s ago
$ kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master1 Ready control-plane 14d v1.24.7 k8s-master2 Ready control-plane 14d v1.24.7 k8s-master3 Ready control-plane 14d v1.24.7 k8s-work1 NotReady <none> 14d v1.24.7 k8s-work2 Ready <none> 14d v1.24.7
查看节点详细信息
$ kubectl describe node k8s-work1 ... Conditions: Ready False Tue, 15 Nov 2022 10:14:49 +0800 Tue, 15 Nov 2022 10:07:39 +0800 KubeletNotReady PLEG is not healthy: pleg was last seen active 10m13.755045415s ago; threshold is 3m0s
异常原因
集群因为此原因(PLEG is not healthy: pleg was last seen active ***h**m***s ago;)状态变为 NotReady,通常是因为节点超负载。
container runtime is down, container runtime not ready
排查过程:
检查集群中的 Pod 分布情况时,发现某一节点上几乎所有的 Pod 都被调度去了其他节点,当前检查时此节点的状态已经是 Ready,针对此情况进行分析。
确定问题发生的大概时间段
根据 Pod 在其他节点上面被启动的时间,可以大概确定节点异常的时间,根据此时间段可以缩小排查的时间范围。此示例中问题发生的时间大概在 Nov 25 04:49:00 前后。
检查 kubelet 日志
根据已经推断出的时间段,在 问题节点 上,检查 kubelet 日志
$ journalctl -u kubelet --since "2022-11-25 4:40" | grep -v -e "failed to get fsstats" -e "invalid bearer token" | more Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.153132 17604 generic.go:205] "GenericPLEG: Unable to retrieve pods" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.375524 17604 remote_runtime.go:356] "ListPodSandbox with filter from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" filter="&PodSandboxFilter{Id:,State:&PodSandboxStateValue{State:SANDBOX_READY,},LabelSelector:map[string]string{},}" Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.375559 17604 kuberuntime_sandbox.go:292] "Failed to list pod sandboxes" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.375578 17604 kubelet_pods.go:1153] "Error listing containers" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.375589 17604 kubelet.go:2162] "Failed cleaning pods" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.375603 17604 kubelet.go:2166] "Housekeeping took longer than 15s" err="housekeeping took too long" seconds=119.005290203 Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.476011 17604 kubelet.go:2010] "Skipping pod synchronization" err="container runtime is down" Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.507861 17604 remote_runtime.go:680] "ExecSync cmd from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" containerID="5cd867ce2a52311e79a20a113c7cedd2a233b3a52b556065b479f2dd11a14eac" cmd=[wget --no-check-certificate --spider -q http://localhost:8088/health] Nov 25 04:49:00 k8s-work2 kubelet[17604]: E1125 04:49:00.676271 17604 kubelet.go:2010] "Skipping pod synchronization" err="container runtime is down"
Nov 25 04:49:01 k8s-work2 kubelet[17604]: E1125 04:49:01.076918 17604 kubelet.go:2010] "Skipping pod synchronization" err="container runtime is down" Nov 25 04:49:01 k8s-work2 kubelet[17604]: E1125 04:49:01.178942 17604 kubelet.go:2359] "Container runtime not ready" runtimeReady="RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: operation timeout: context deadline exceeded" Nov 25 04:49:01 k8s-work2 kubelet[17604]: E1125 04:49:01.878007 17604 kubelet.go:2010] "Skipping pod synchronization" err="[container runtime is down, container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: operation timeout: context deadline exceeded]" Nov 25 04:49:03 k8s-work2 kubelet[17604]: E1125 04:49:03.329558 17604 remote_runtime.go:536] "ListContainers with filter from runtime service failed" err="rpc error: code = Unknown desc = operation timeout: context deadline exceeded" filter="&ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},}" Nov 25 04:49:03 k8s-work2 kubelet[17604]: E1125 04:49:03.329585 17604 container_log_manager.go:183] "Failed to rotate container logs" err="failed to list containers: rpc error: code = Unknown desc = operation timeout: context deadline exceeded"
Nov 25 04:49:09 k8s-work2 kubelet[17604]: E1125 04:49:09.485356 17604 remote_runtime.go:168] "Version from runtime service failed" err="rpc error: code = Unknown desc = failed to get docker version: operation timeout: context deadline exceeded" Nov 25 04:49:09 k8s-work2 kubelet[17604]: I1125 04:49:09.485486 17604 setters.go:532] "Node became not ready" node="k8s-work2" condition={Type:Ready Status:False LastHeartbeatTime:2022-11-25 04:49:09.485445614 +0800 CST m=+227600.229789769 LastTransitionTime:2022-11-25 04:49:09.485445614 +0800 CST m=+227600.229789769 Reason:KubeletNotReady Message:[container runtime is down, container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: operation timeout: context deadline exceeded]}
从以上日志中,可以看到关键的日志信息:
"Skipping pod synchronization" err="container runtime is down"
setters.go:532] "Node became not ready", Reason:KubeletNotReady Message:[container runtime is down, container runtime not ready: RuntimeReady=false reason:DockerDaemonNotReady message:docker: failed to get docker version: operation timeout: context deadline exceeded]}
从以上日志信息可以看出,节点状态变为了 not ready,原因为 container runtime is down, container runtime not ready,本示例中 container runtime 为 docker
检查 docker 服务日志
根据上面的日志时间,检查 docker 服务的日志
journalctl -u docker --since "2022-11-25 04:0" | more Nov 25 04:49:06 k8s-work2 dockerd[15611]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11) Nov 25 04:49:06 k8s-work2 dockerd[15611]: time="2022-11-25T04:49:06.410127201+08:00" level=error msg="Handler for GET /v1.40/containers/5cd867ce2a52311e79a20a113c7cedd2a233b3a52b556065b479f2dd11a14eac/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Nov 25 04:49:06 k8s-work2 dockerd[15611]: time="2022-11-25T04:49:06.410342223+08:00" level=error msg="Handler for GET /v1.40/containers/41e0dfe97b87c2b8ae941653fa8adbf93bf9358d91e967646e4549ab71b2f004/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Nov 25 04:49:06 k8s-work2 dockerd[15611]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11) Nov 25 04:49:06 k8s-work2 dockerd[15611]: time="2022-11-25T04:49:06.414773158+08:00" level=error msg="Handler for GET /v1.40/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Nov 25 04:49:06 k8s-work2 dockerd[15611]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11) Nov 25 04:49:06 k8s-work2 dockerd[15611]: time="2022-11-25T04:49:06.416474238+08:00" level=error msg="Handler for GET /v1.40/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe" Nov 25 04:49:06 k8s-work2 dockerd[15611]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.WriteJSON (httputils_write_json.go:11) Nov 25 04:49:06 k8s-work2 dockerd[15611]: time="2022-11-25T04:49:06.422844592+08:00" level=error msg="Handler for GET /v1.40/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
$ kubectl get nodes The connection to the server kube-apiserver:6443 was refused - did you specify the right host or port?
检查 Api Server 监听的端口 6443 ,显示端口未启动。
检查 Api Server 对应的容器状态
$ docker ps -a | grep api 81688b9cbe45 1f38c0b6a9d1 "kube-apiserver --ad…" 14 seconds ago Exited (1) 13 seconds ago k8s_kube-apiserver_kube-apiserver-k8s-uat-master1.kube-system_c8a87f4921623c7bff57f5662ea486cc_25
容器状态为 Exited,检查容器日志
$ docker logs 81688b9cbe45 I1116 07:43:53.775588 1 server.go:558] external host was not specified, using 172.31.30.123 I1116 07:43:53.776035 1 server.go:158] Version: v1.24.7 I1116 07:43:53.776057 1 server.go:160] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK="" E1116 07:43:53.776298 1 run.go:74] "command failed" err="open /etc/kubernetes/pki/apiserver.crt: no such file or directory"
日志显示 err="open /etc/kubernetes/pki/apiserver.crt: no such file or directory",检查文件 /etc/kubernetes/pki/apiserver.crt
$ ls /etc/kubernetes/pki/apiserver.crt ls: cannot access /etc/kubernetes/pki/apiserver.crt: No such file or directory
$ kubeadm init phase certs apiserver \ --apiserver-advertise-address 10.150.0.21 \ --apiserver-cert-extra-sans 10.96.0.1 \ --apiserver-cert-extra-sans 34.150.1.1 [certs] Generating "apiserver" certificate and key [certs] apiserver serving cert is signed for DNS names [k8s-master kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.150.0.21 34.150.1.1]
context deadline exceeded
kube-apiserver 无法正常启动,检查 kube-apiserver 相关容器日志
# kubectl get nodes The connection to the server kube-apiserver:6443 was refused - did you specify the right host or port? # docker logs -f f39205f67e71 server.go:558] external host was not specified, using 172.31.29.250 server.go:158] Version: v1.24.7 server.go:160] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK="" shared_informer.go:255] Waiting for caches to sync for node_authorizer plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook. plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota. plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook. plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota. run.go:74] "command failed" err="context deadline exceeded" clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
# cd /etc/kubernetes/ # openssl x509 -text -in apiserver.crt Certificate: Data: Version: 3 (0x2) Serial Number: 2154708302505735210 (0x1de70fc8f570742a) Signature Algorithm: sha256WithRSAEncryption Issuer: CN=kubernetes Validity Not Before: Nov 1 01:11:01 2022 GMT Not After : Nov 1 01:24:48 2023 GMT Subject: CN=kube-apiserver # kubeadm certs check-expiration [check-expiration] Reading configuration from the cluster... [check-expiration] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [check-expiration] Error reading configuration from the Cluster. Falling back to default configuration
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED admin.conf Nov 01, 2023 01:24 UTC <invalid> ca no apiserver Nov 01, 2023 01:24 UTC <invalid> ca no apiserver-etcd-client Nov 01, 2023 01:24 UTC <invalid> etcd-ca no apiserver-kubelet-client Nov 01, 2023 01:24 UTC <invalid> ca no controller-manager.conf Nov 01, 2023 01:24 UTC <invalid> ca no etcd-healthcheck-client Nov 01, 2023 01:24 UTC <invalid> etcd-ca no etcd-peer Nov 01, 2023 01:24 UTC <invalid> etcd-ca no etcd-server Nov 01, 2023 01:24 UTC <invalid> etcd-ca no front-proxy-client Nov 01, 2023 01:24 UTC <invalid> front-proxy-ca no scheduler.conf Nov 01, 2023 01:24 UTC <invalid> ca no
CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED ca Oct 29, 2032 01:11 UTC 8y no etcd-ca Oct 29, 2032 01:11 UTC 8y no front-proxy-ca Oct 29, 2032 01:11 UTC 8y no
# tar -cf /etc/kubernetes/kubernetes.20231115.tar /etc/kubernetes/kubernetes # kubeadm certs renew all [renew] Reading configuration from the cluster... [renew] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml' [renew] Error reading configuration from the Cluster. Falling back to default configuration
certificate embedded in the kubeconfig file for the admin to use and for kubeadm itself renewed certificate for serving the Kubernetes API renewed certificate the apiserver uses to access etcd renewed certificate for the API server to connect to kubelet renewed certificate embedded in the kubeconfig file for the controller manager to use renewed certificate for liveness probes to healthcheck etcd renewed certificate for etcd nodes to communicate with each other renewed certificate for serving etcd renewed certificate for the front proxy client renewed certificate embedded in the kubeconfig file for the scheduler manager to use renewed
Done renewing certificates. You must restart the kube-apiserver, kube-controller-manager, kube-scheduler and etcd, so that they can use the new certificates. # systemctl restart kubelet # kubectl get nodes error: You must be logged in to the server (Unauthorized) # export KUBECONFIG=/etc/kubernetes/admin.conf # kubectl get nodes NAME STATUS ROLES AGE VERSION k8s-master1 Ready control-plane 379d v1.24.7 k8s-master2 Ready control-plane 379d v1.24.7 k8s-master3 Ready control-plane 379d v1.24.7 k8s-work1 NotReady <none> 379d v1.24.7 k8s-work2 Ready <none> 379d v1.24.7
$ journalctl -f -u kubelet Jul 13 14:18:43 k8s-node-5 kubelet[28572]: E0713 14:18:43.328660 28572 server.go:292] "Failed to run kubelet" err="failed to run Kubelet: misconfiguration: kubelet cgroup driver: \"systemd\" is different from docker cgroup driver: \"cgroupfs\""
# journalctl -f -u kubelet kubelet[22771]: E0908 14:10:08.325316 22771 server.go:292] "Failed to run kubelet" err="failed to run Kubelet: misconfiguration: kubelet cgroup driver: \"cgroupfs\" is different from docker cgroup driver: \"systemd\""
# kubectl get componentstatuses Warning: v1 ComponentStatus is deprecated in v1.19+ NAME STATUS MESSAGE ERROR scheduler Healthy ok controller-manager Healthy ok etcd-0 Healthy {"health":"true"}
sa.key 用于签发新的 Service Account JWT Tokens。通常由集群管理员或自动化工具(如 kubeadm)安全地生成和管理。
sa.pub(非 TLS 证书,而是密钥对)。 由 Kubernetes API 服务器用来验证 Token 的合法性。与私钥配对使用。
在 Kubernetes 中,Service Account Keys 是一对公钥和私钥,用于验证和签发 Service Account Tokens。这些令牌允许 Pod 以特定的 Service Account 身份与 Kubernetes API 服务器进行认证和授权。
apiserver-kubelet-client
这对证书和密钥用于 API 服务器请求集群中每个节点上的 kubelet 时的安全通信。API 服务器使用这些证书与 kubelet 通信,进行诸如启动 Pod、获取节点状态等操作。
apiserver-kubelet-client.crt
apiserver-kubelet-client.key
etcd 集群证书
etcd CA 证书
Etcd CA 证书用于签发 etcd 集群相关的证书,如 etcd 服务器证书、etcd 客户端证书、Peer 实体证书等。
etcd/ca.crt
etcd/ca.key
apiserver-etcd-client
这对证书和密钥用于 Kubernetes API 服务器请求 etcd 数据库时的加密通信
apiserver-etcd-client.crt
apiserver-etcd-client.key
Found multiple CRI endpoints on the host
使用 kubeadm 重置集群证书时报错,具体操作如下
# sudo kubeadm init phase certs all Found multiple CRI endpoints on the host. Please define which one do you wish to use by setting the 'criSocket' field in the kubeadm configuration file: unix:///var/run/containerd/containerd.sock, unix:///var/run/cri-dockerd.sock To see the stack trace of this error execute with --v=5 or higher
# sudo kubeadm init phase certs all --config=./kubeadm-config.yaml W1208 14:19:51.376721 13737 common.go:84] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta2". Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version. W1208 14:19:51.376972 13737 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/containerd/containerd.sock". Please update your configuration! I1208 14:19:51.517814 13737 version.go:255] remote version is much newer: v1.28.4; falling back to: stable-1.24 [certs] Using certificateDir folder "/etc/kubernetes/pki" [certs] Using existing ca certificate authority [certs] Generating "apiserver" certificate and key [certs] apiserver serving cert is signed for DNS names [fm-k8s-c1-master1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.31.26.116] [certs] Generating "apiserver-kubelet-client" certificate and key [certs] Generating "front-proxy-ca" certificate and key [certs] Generating "front-proxy-client" certificate and key [certs] Generating "etcd/ca" certificate and key [certs] Generating "etcd/server" certificate and key [certs] etcd/server serving cert is signed for DNS names [fm-k8s-c1-master1 localhost] and IPs [172.31.26.116 127.0.0.1 ::1] [certs] Generating "etcd/peer" certificate and key [certs] etcd/peer serving cert is signed for DNS names [fm-k8s-c1-master1 localhost] and IPs [172.31.26.116 127.0.0.1 ::1] [certs] Generating "etcd/healthcheck-client" certificate and key [certs] Generating "apiserver-etcd-client" certificate and key [certs] Using the existing "sa" key
kubeadm 更新 apiserver 证书报错
使用以下方式更新 kube-apiserver 的证书时报错
# kubeadm init phase certs apiserver --apiserver-advertise-address --apiserver-cert-extra-sans kubernetes --apiserver-cert-extra-sans kubernetes.default --apiserver-cert-extra-sans kubernetes.default.svc --config=/home/username/kubeadm-config.yaml can not mix '--config' with arguments [apiserver-advertise-address apiserver-cert-extra-sans] To see the stack trace of this error execute with --v=5 or higher
# docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES cf4bf9f2c4a3 9efa6dff568f "kube-scheduler --au…" About a minute ago Exited (1) About a minute ago k8s_kube-scheduler_kube-scheduler-k8s-master3_kube-system_a3a06a8f4bb3d9a7a753421061337314_808 27e06d290dbb 9e2bfc195de6 "kube-controller-man…" 2 minutes ago Exited (1) 2 minutes ago k8s_kube-controller-manager_kube-controller-manager-k8s-master3_kube-system_1d62164acfdda6946d09aa8255b4b191_808 b0aa1a2e24ee c7cbaca6e63b "kube-apiserver --ad…" 3 minutes ago Exited (1) 3 minutes ago k8s_kube-apiserver_kube-apiserver-k8s-master3_kube-system_fd413ffd28d3bcce4b1330c38307ebe2_791
# docker logs b0aa1a2e24ee clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error". Reconnecting... authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-08T07:21:21Z is after 2023-12-06T09:58:58Z, verifying certificate SN=1505375741374655454, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-08T07:21:21Z is after 2023-12-06T09:58:58Z]" clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: internal error". Reconnecting... authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-08T07:21:22Z is after 2023-12-06T09:58:58Z, verifying certificate SN=1505375741374655454, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-08T07:21:22Z is after 2023-12-06T09:58:58Z]"
# docker logs etcd {"logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"651e0623614c0f76 is starting a new election at term 13"} {"logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"651e0623614c0f76 became pre-candidate at term 13"} {"logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"651e0623614c0f76 received MsgPreVoteResp from 651e0623614c0f76 at term 13"} {"logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"651e0623614c0f76 [logterm: 13, index: 216516485] sent MsgPreVote request to 71e91a3cb0d95be8 at term 13"} {"logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"651e0623614c0f76 [logterm: 13, index: 216516485] sent MsgPreVote request to d0ca64fcbfb25318 at term 13"} {"caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"d0ca64fcbfb25318","rtt":"0s","error":"x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"} {"caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"d0ca64fcbfb25318","rtt":"0s","error":"x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"} {"caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_RAFT_MESSAGE","remote-peer-id":"71e91a3cb0d95be8","rtt":"0s","error":"x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"} {"caller":"rafthttp/probing_status.go:68","msg":"prober detected unhealthy status","round-tripper-name":"ROUND_TRIPPER_SNAPSHOT","remote-peer-id":"71e91a3cb0d95be8","rtt":"0s","error":"x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")"}
检查 etcd 节点容器日志,日志中有报错显示证书认证问题:x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"etcd-ca\")",根据错误提示,可能是 etcd 集群使用的证书存在问题,具体错误说明 etcd 节点之间在尝试相互验证对方证书时遇到了问题。这通常发生在节点使用不同的 CA 证书,或者配置不正确时。。
通常情况下,在启用了 --client-cert-auth=true 和 --peer-client-cert-auth=true 的 etcd 集群中,etcd 对等节点间需要使用 HTTPS 证书进行加密通信,对等实体间需要验证客户端证书,这个过程中需要相同的 CA 证书进行 CA 机构的签名校验。根据日志提示,问题应集中在 etcd 集群中的证书上。
根据以上日志及思路引导,首先检查 etcd 各个节点的 CA 证书是否一致,结果发现 3 个 etcd 节点上面的 CA 证书不同
# kubectl logs -n kube-system kube-apiserver-k8s-master3 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting... clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting... clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting... clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting... clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate". Reconnecting...
在 master3节点上执行以下命令更新 kube-apiserver 连接 etcd 时使用的客户端证书 /etc/kubernetes/pki/apiserver-etcd-client.crt,需要先删除 /etc/kubernetes/pki/apiserver-etcd-client.crt 和 /etc/kubernetes/pki/apiserver-etcd-client.key,否则更新时会报错:error execution phase certs/apiserver-etcd-client: [certs] certificate apiserver-etcd-client not signed by CA certificate etcd/ca: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "etcd-ca")
# kubeadm init phase certs apiserver-etcd-client --config=/root/kubeadm-config.yaml W1211 15:06:57.911002 1401 common.go:84] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta2". Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version. W1211 15:06:57.911751 1401 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/containerd/containerd.sock". Please update your configuration! I1211 15:06:58.238799 1401 version.go:255] remote version is much newer: v1.28.4; falling back to: stable-1.24 [certs] Generating "apiserver-etcd-client" certificate and key
# kubectl logs -n kube-system kube-controller-manager-k8s-master3 I1211 07:04:37.643782 1 serving.go:348] Generated self-signed cert in-memory failed to create listener: failed to listen on 0.0.0.0:10257: listen tcp 0.0.0.0:10257: bind: address already in use # kubectl logs -n kube-system kube-scheduler-k8s-master3 I1211 07:10:21.304329 1 serving.go:348] Generated self-signed cert in-memory E1211 07:10:21.304587 1 run.go:74] "command failed" err="failed to create listener: failed to listen on 0.0.0.0:10259: listen tcp 0.0.0.0:10259: bind: address already in use"
更新证书后,kube-apiserver 报证书过期错误
在更新集群证书后,kube-apiserver 异常,检查日志,有证书过期的错误
authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z, verifying certificate SN=2750116196247444292, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z]" authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z, verifying certificate SN=2750116196247444292, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z]" authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z, verifying certificate SN=2750116196247444292, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z]" authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z, verifying certificate SN=2750116196247444292, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z]" authentication.go:63] "Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z, verifying certificate SN=2750116196247444292, SKID=, AKID=08:39:2B:D0:14:00:F4:7F:3F:58:26:36:32:BA:F8:0E:0E:B4:D4:83 failed: x509: certificate has expired or is not yet valid: current time 2023-12-11T07:35:22Z is after 2023-12-06T09:50:35Z]" authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]" authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"