Kubernetes Metrics Server 从 kubelet 收集资源使用指标(情况)并通过 Metrics API 将其暴露给 Kubernetes API Server,以供 HPA(Horizontal Pod Autoscaler) 和 VPA(Vertical Pod Autoscaler) 使用。kubectl top 也使用 Metrics API。[1]
安装 Kubernetes Metrics Server
Kubernetes Metrics Server 安装之前必须要开启 kube api-server 的聚合层功能以及认证鉴权功能 [3]
$ kubectl get pods -A | grep metrics kube-system metrics-server-5cdf47479d-rwtd6 0/1 Running 0 5m53s
检查 Pod 日志
$ kubectl logs -n kube-system metrics-server-5cdf47479d-rwtd6 1 scraper.go:140] "Failed to scrape node" err="Get \"https://172.31.21.3:10250/metrics/resource\": x509: cannot validate certificate for 172.31.21.3 because it doesn't contain any IP SANs" node="k8smaster3" 1 scraper.go:140] "Failed to scrape node" err="Get \"https://172.31.26.116:10250/metrics/resource\": x509: cannot validate certificate for 172.31.26.116 because it doesn't contain any IP SANs" node="k8smaster1" 1 scraper.go:140] "Failed to scrape node" err="Get \"https://172.31.19.164:10250/metrics/resource\": x509: cannot validate certificate for 172.31.19.164 because it doesn't contain any IP SANs" node="k8smaster2"
参考部署步骤 部署后,Metrics Server 的 Pod 一直处于未就绪状态,检查 Pod 日志
$ kubectl logs -n kube-system metrics-server-5cdf47479d-rwtd6 "Failed to scrape node" err="Get \"https://k8smaster1:10250/metrics/resource\": dial tcp: lookup k8smaster1 on 10.96.0.10:53: no such host" node="k8smaster1" "Failed probe" probe="metric-storage-ready" err="no metrics to serve" "Failed to scrape node" err="Get \"https://k8smaster1:10250/metrics/resource\": dial tcp: lookup k8smaster1 on 10.96.0.10:53: no such host" node="k8smaster1" "Failed to scrape node" err="Get \"https://k8smaster3:10250/metrics/resource\": dial tcp: lookup k8smaster3 on 10.96.0.10:53: no such host" node="k8smaster3" "Failed to scrape node" err="Get \"https://k8sworker1:10250/metrics/resource\": dial tcp: lookup k8sworker1 on 10.96.0.10:53: no such host" node="k8sworker1" "Failed to scrape node" err="Get \"https://k8sworker2:10250/metrics/resource\": dial tcp: lookup k8sworker2 on 10.96.0.10:53: no such host" node="k8sworker2" "Failed to scrape node" err="Get \"https://k8smaster2:10250/metrics/resource\": dial tcp: lookup k8smaster2 on 10.96.0.10:53: no such host" node="k8smaster2" "Failed probe" probe="metric-storage-ready" err="no metrics to serve" I0524 09:32:43.033797 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
根据日志显示,API Server 的主机名解析存在问题。这是因为节点主机名在集群的 DNS 中无法解析导致,可以通过在 Metrics Server 的 Pod 中手动添加解析解决
重新部署后,登陆容器 centos7,安装所需工具进行测试。根据日志信息,首先测试 Pod 是否能连接到节点的 kubelet
$ curl -v 172.31.16.124:10250 * About to connect() to 172.31.16.124 port 10250 (#0) * Trying 172.31.16.124... * No route to host * Failed connect to 172.31.16.124:10250; No route to host * Closing connection 0 curl: (7) Failed connect to 172.31.16.124:10250; No route to host $ curl -v 172.31.22.159:10250 * About to connect() to 172.31.22.159 port 10250 (#0) * Trying 172.31.22.159... * Connected to 172.31.22.159 (172.31.22.159) port 10250 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: 172.31.22.159:10250 > Accept: */* > * HTTP 1.0, assume close after body < HTTP/1.0 400 Bad Request < Client sent an HTTP request to an HTTPS server. * Closing connection 0
根据以上测试,Metrics Server 的 Pod 无法连接所在节点 k8s-worker1(172.31.16.124)的 kubelet (172.31.16.124:10250),可以正常连接其他节点的 kubelet。由此可以确定问题原因。
考虑到 Metrics Server 的 Pod 只是访问不到宿主节点所在的 kubelet,可以访问其他节点的 kubelet,梳理其中的网络连同流程发现,在访问其他节点的 kubelet 时,Metrics Server Pod 的报文在流出宿主节点前,会被 SNAT 为宿主节点的出口 IP,报文源 IP 为 宿主节点的 IP。而访问宿主节点的 kubelet 的报文,其源 IP 为 Metrics Server Pod 的 IP,目的 IP 为宿主节点的 IP。怀疑可能因为集群节点上的 iptables 允许集群节点的 IP 访问 kubelet,而 Pod 的 IP 未被允许访问 kubelet。为验证此猜想,在节点 k8s-worker1 的 iptables 添加允许 Metrics Server Pod 的 IP 访问的规则进行测试
iptables -I INPUT 7 -s 10.244.4.138 -j ACCEPT
再次测试和 kubelet 的连通性,发现可以正常连通,再次检查 kubectl top node,可以查到所有节点的监控数据
$ curl -v 172.31.16.124:10250 * About to connect() to 172.31.16.124 port 10250 (#0) * Trying 172.31.16.124... * Connected to 172.31.16.124 (172.31.16.124) port 10250 (#0) > GET / HTTP/1.1 > User-Agent: curl/7.29.0 > Host: 172.31.16.124:10250 > Accept: */* > * HTTP 1.0, assume close after body < HTTP/1.0 400 Bad Request < Client sent an HTTP request to an HTTPS server. * Closing connection 0 $ kubectl top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% fm-k8s-c1-master1 286m 7% 3242Mi 42% fm-k8s-c1-master2 150m 3% 3262Mi 43% fm-k8s-c1-master3 251m 6% 3247Mi 42% fm-k8s-c1-worker1 166m 1% 4317Mi 13% fm-k8s-c1-worker2 2013m 12% 21684Mi 70%