


























作者:尹正杰
版权声明:原创作品,谢绝转载!否则将追究法律责任。
1.查看各节点的BGP路由
[root@ansible99 ~]# dk ansible -i /etc/kubeasz/clusters/yinzhengjie-k8s/hosts all -m shell -a 'calicoctl node status'
10.0.0.233 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.231 | node specific | up | 09:52:22 | Established |
| 10.0.0.232 | node specific | up | 09:52:22 | Established |
| 10.0.0.77 | node specific | up | 09:52:22 | Established |
| 172.20.0.1 | node specific | start | 09:52:20 | Connect |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.231 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.232 | node specific | up | 09:52:21 | Established |
| 10.0.0.233 | node specific | up | 09:52:21 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.232 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.231 | node specific | up | 09:52:22 | Established |
| 10.0.0.233 | node specific | up | 09:52:22 | Established |
| 10.0.0.77 | node specific | up | 09:52:22 | Established |
| 172.20.0.1 | node specific | start | 09:52:20 | Connect |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.66 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+---------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+---------+
| 10.0.0.232 | node specific | start | 09:52:19 | Connect |
| 10.0.0.233 | node specific | start | 09:52:19 | Connect |
+--------------+---------------+-------+----------+---------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.77 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.232 | node specific | up | 09:52:22 | Established |
| 10.0.0.233 | node specific | up | 09:52:22 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
[root@ansible99 ~]#
2.故障描述
发现了worker232和worker233节点存在IP地址为'172.20.0.1'的'mynet0'网卡。
1.查看本地的路由信息
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.231 | node specific | up | 09:52:22 | Established |
| 10.0.0.233 | node specific | up | 09:52:22 | Established |
| 10.0.0.77 | node specific | up | 09:52:22 | Established |
| 172.20.0.1 | node specific | start | 09:52:20 | Connect |
+--------------+---------------+-------+----------+-------------+
2.分析
其中172.20.0.1的地址处于TCP 握手阶段,卡在 Connect 无法进入 Established,属于异常邻居。
而这个网卡是我在部署集群时,自动给我创了了一个虚拟网卡导致的,分析如下所示:
[root@worker66 ~]# ip a
...
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:fb:13:4a brd ff:ff:ff:ff:ff:ff
altname enp2s1
altname ens33
inet 10.0.0.66/24 brd 10.0.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fefb:134a/64 scope link
valid_lft forever preferred_lft forever
...
4: mynet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether a2:43:73:de:11:4b brd ff:ff:ff:ff:ff:ff
inet 172.20.0.1/16 brd 172.20.255.255 scope global mynet0
valid_lft forever preferred_lft forever
inet6 fe80::a043:73ff:fede:114b/64 scope link
valid_lft forever preferred_lft forever
...
[root@worker66 ~]#
1.删除网桥设备
[root@worker66 ~]# ifconfig mynet0
mynet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 172.20.0.1 netmask 255.255.0.0 broadcast 172.20.255.255
inet6 fe80::a043:73ff:fede:114b prefixlen 64 scopeid 0x20<link>
ether a2:43:73:de:11:4b txqueuelen 1000 (Ethernet)
RX packets 12010 bytes 1401177 (1.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 12231 bytes 9498334 (9.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
[root@worker66 ~]#
[root@worker66 ~]# ip link set dev mynet0 down # 先关闭网卡
[root@worker66 ~]#
[root@worker66 ~]# ip link delete mynet0 # 删除网桥设备
[root@worker66 ~]#
[root@worker66 ~]# ifconfig mynet0
mynet0: error fetching interface information: Device not found
[root@worker66 ~]#
2.再次查看路由信息【恢复正常】
[root@ansible99 ~]# dk ansible -i /etc/kubeasz/clusters/yinzhengjie-k8s/hosts all -m shell -a 'calicoctl node status'
10.0.0.232 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.231 | node specific | up | 09:52:22 | Established |
| 10.0.0.233 | node specific | up | 09:52:22 | Established |
| 10.0.0.77 | node specific | up | 09:52:22 | Established |
| 10.0.0.66 | node specific | up | 11:24:09 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.231 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.232 | node specific | up | 09:52:21 | Established |
| 10.0.0.233 | node specific | up | 09:52:21 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.77 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.232 | node specific | up | 09:52:22 | Established |
| 10.0.0.233 | node specific | up | 09:52:22 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.233 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.231 | node specific | up | 09:52:22 | Established |
| 10.0.0.232 | node specific | up | 09:52:22 | Established |
| 10.0.0.77 | node specific | up | 09:52:22 | Established |
| 10.0.0.66 | node specific | up | 11:24:09 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
10.0.0.66 | CHANGED | rc=0 >>
Calico process is running.
IPv4 BGP status
+--------------+---------------+-------+----------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+--------------+---------------+-------+----------+-------------+
| 10.0.0.232 | node specific | up | 11:24:09 | Established |
| 10.0.0.233 | node specific | up | 11:24:09 | Established |
+--------------+---------------+-------+----------+-------------+
IPv6 BGP status
No IPv6 peers found.
[root@ansible99 ~]#

[root@ansible99 ~]# ./ezdown -D
2026-06-01 01:19:53 [ezdown:717] INFO Action begin: download_all
2026-06-01 01:19:53 [ezdown:162] INFO downloading docker binaries, arch:x86_64, version:20.10.24
--2026-06-01 01:19:53-- https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/static/stable/x86_64/docker-20.10.24.tgz
Resolving mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)... 101.6.15.130, 2402:f000:1:400::2
Connecting to mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)|101.6.15.130|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2026-06-01 01:19:53 ERROR 403: Forbidden.
2026-06-01 01:19:53 [ezdown:164] ERROR downloading docker failed
[root@ansible99 ~]#
下载docker-20.10.24.tgz软件包失败,
[root@ansible99 ~]# wget https://mirrors.tuna.tsinghua.edu.cn/docker-ce/linux/static/stable/x86_64/docker-20.10.24.tgz -O /etc/kubeasz/down/docker-20.10.24.tgz

TASK [calico : 配置 calico DaemonSet yaml文件] **************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: If you are using a module and expect the file to exist on the remote, see the remote_src option
fatal: [10.0.0.231]: FAILED! => {"changed": false, "msg": "Could not find or access 'calico-v3.31.yaml.j2'\nSearched in:\n\t/etc/kubeasz/roles/calico/templates/calico-v3.31.yaml.j2\n\t/etc/kubeasz/roles/calico/calico-v3.31.yaml.j2\n\t/etc/kubeasz/roles/calico/tasks/templates/calico-v3.31.yaml.j2\n\t/etc/kubeasz/roles/calico/tasks/calico-v3.31.yaml.j2\n\t/etc/kubeasz/playbooks/templates/calico-v3.31.yaml.j2\n\t/etc/kubeasz/playbooks/calico-v3.31.yaml.j2 on the Ansible Controller.\nIf you are using a module and expect the file to exist on the remote, see the remote_src option"}
NO MORE HOSTS LEFT ******************************************************************************************************************************
根据报错,应该是缺少'calico-v3.31.yaml.j2'对应的模板。
1.下载模板
[root@ansible99 ~]# cd /etc/kubeasz/roles/calico/templates
[root@ansible99 templates]#
[root@ansible99 templates]#
[root@ansible99 templates]# wget https://raw.githubusercontent.com/projectcalico/calico/v3.31.5/manifests/calico.yaml -O calico-v3.31.yaml.j2
2.重新安装网络。
[root@ansible99 ~]# dk ezctl setup yinzhengjie-k8s 06

...
3.10: Pulling from easzlab/pause
61d9e957431b: Pulling fs layer
Get "https://registry-1.docker.io/v2/": dial tcp 128.242.245.93:443: i/o timeout
2026-06-02 12:33:57 [ezdown:556] ERROR download easzlab/pause:3.10 failed!
2026-06-02 12:33:57 [ezdown:718] ERROR Action failed: download_all
[root123@vm1 ~]#
这是因为拉取镜像超时了,可能是网络波动或者网络不可达,需要多重试几次。
解决方案:
- 1.可以多尝试几次;
- 2.如果拉取不了镜像,需要配置docker的VPN代理,效果如下:
[root@k8s-cluster241 ~]# systemctl cat docker
...
[Service]
...
# 主要添加如下3行代码,仅需要修改你自己的VPN地址即可.
Environment="HTTP_PROXY=http://10.0.0.1:7890"
Environment="HTTPS_PROXY=http://10.0.0.1:7890"
Environment="NO_PROXY=localhost,127.0.0.1,easzlab.io.local,.docker.internal,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16"
ExecStart=/opt/kube/bin/dockerd
...
[root@k8s-cluster241 ~]#

如上图所示,80个k8s从业人员,没有搞定滚动更新k8s集群的事情,让我不得不吐槽下。
表扬:
先表扬下Kubeasz的功能,的确有集群管理能力,比如k8s集群部署,扩容,缩容,升级,备份,恢复都搞定测试了,的确都支持。
不足:
在扩容,缩容k8s集群节点(包括但不限于master,slave,etcd)过程中,k8s集群不可用,根本就没有做到滚动更新。而是批量停止服务!!!!
当然,我测试的版本是k8s1.33.11,但是官网更新了k8s 1.33.12,Kubeasz截止今天依旧没有更新到最新版本的功能。
点评:
Kubeasz的确是一个不错的点子,思想也不错,就是在具体功能上的逻辑能力还要在自信斟酌优化下。
生产环境建议:
不推荐使用!学习环境还是一个不错的工具,如果生产环境你已经用上了,建议升级,备份,恢复,扩容和缩容等功能,要重新去修改playbook的相关功能。

我生产环境2000+的GPU类型的k8s节点,暂时不考虑使用Kubeasz来管理了,近期打算调研下由k8s社区开源的一个kubespray工具。
Kubespray(原 Kargo):基于 Ansible、CNCF 维护的 K8s 一键部署工具,原生生产级 HA 集群,裸机 / 虚拟化 / 公有云全适配,主流企业自建集群首选方案。
参考链接:
https://github.com/kubernetes-sigs/kubespray
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。