惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

S
Securelist
O
OpenAI News
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Threat Research - Cisco Blogs
D
Darknet – Hacking Tools, Hacker News & Cyber Security
Google Online Security Blog
Google Online Security Blog
C
CXSECURITY Database RSS Feed - CXSecurity.com
N
News and Events Feed by Topic
S
Security Affairs
SecWiki News
SecWiki News
Project Zero
Project Zero
L
Lohrmann on Cybersecurity
P
Proofpoint News Feed
P
Palo Alto Networks Blog
L
LINUX DO - 最新话题
H
Hacker News: Front Page
Recent Commits to openclaw:main
Recent Commits to openclaw:main
I
Intezer
Simon Willison's Weblog
Simon Willison's Weblog
W
WeLiveSecurity
T
The Exploit Database - CXSecurity.com
K
Kaspersky official blog
The GitHub Blog
The GitHub Blog
I
InfoQ
云风的 BLOG
云风的 BLOG
雷峰网
雷峰网
B
Blog
IT之家
IT之家
AWS News Blog
AWS News Blog
Jina AI
Jina AI
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
Google DeepMind News
Google DeepMind News
Spread Privacy
Spread Privacy
N
News and Events Feed by Topic
Security Latest
Security Latest
美团技术团队
C
Check Point Blog
WordPress大学
WordPress大学
T
Tenable Blog
S
Security @ Cisco Blogs
Last Week in AI
Last Week in AI
博客园 - 聂微东
月光博客
月光博客
博客园 - 【当耐特】
S
Schneier on Security
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
Secure Thoughts
Schneier on Security
Schneier on Security
C
Cisco Blogs
Cyberwarzone
Cyberwarzone

Proxmox Support Forum

[SOLVED] - Github Auth for Mirrors-Kernel Repo? [Automation] Mass migration tool for MS Win11/Server Proxmox GUI hang - not response is it possible to reject or quarantine spam based on conditions I set ? The PVENode task list in PVE9 is partially obscured due to the terminal font being too large. About 100% error reporting due to pveproxy.service hooks Kubernetes overlay networking breaks when upgrading from PVE 9.1 to PVE 9.2.3 Zentraler Speicher No space left on device Combine datastore and direct file archival to tape Kernel panic VFS: Unable to mount root fs on unknown-block (0,0) sobald ein 7.x Kernel verwendet wird. How to migrate disk of a VM from one ZFS to another Windows Server 2025 fails to boot after PVE 9.2 / Linux 7.0 Kernel upgrade Cannot Install Proxmox on T610 Poweredge with H700 PERC card sdn Config. gateway not reachable How to safely change domain/FQDN? Welche Filterquote erreicht ihr? NFS Share status unknown on 2 of 5 nodes Can't connect to PVE9 consoles [solved] Can't connect to PVE9 consoles [solved] [SOLVED] - Use secondary network for PVE commands Created cluster, one node storage gone BUG: proxmox mail gateway FROM = null bypass spam filtering Moving existing PBS from VMWare workstation to PVE cluster Does eBGP SDN fabric support external peering? Bug: PDM 1.1 not recognizing valid license status Proxmox GUI hang - not response PVE crashes unexpectedly Proxmox Backup Server 4.2 released! Advice ceph-osd crashes with kernel 6.17.2-1-pve on Dell system [META] Links on Proxmox Forum Website Hardwarer oder Software RAID Joining a cluster with already created guests VM PDM missing backup jobs from PVE / Log retention Remove VM.Monitor from all users/roles, PVE 9.2 Proxmox Freezing (new instalation) 9.2.2 - Intel 12700T No Web gui and random connection reset by peer [SOLVED] - i40e module for X710 Intel NIC Dutch Proxmox Day 2026 How pools use the space Corosync initiiert Reboot trotz Verfügbarkeit der Systeme Opt-in Linux 7.0 Kernel for Proxmox VE 9 available After PVE 8to9 upgrade, unable to check guest fs freeze status Problem with MegaRAID SAS3508 controller proxmox-kernel-7.0.2-6-pve failing network service Auto sync guest time after rollback of VM snapshot with RAM/state Broadcom BCM57504 (100G) bnxt_en TX timeout and NIC reset on Proxmox 8.1.5 — while BCM57414 (25G) works fine on same host QEMU 11.0 available on pve-test and pve-no-subscription as of now 350 MPM Solventless Lamination Machine for High-Speed Flexible Packaging Making sense of NVMe zfs and SMART errors [SOLVED] - PVE loses network connection after kernel upgrade to proxmox-kernel-7.0.0-3-pve [SOLVED] - Remove or reset cluster configuration. Proxmox 8.4.1 Fresh Install BCM57416 10G Ethernet Adapter Not Recognized PDM 1.1.1 unable to add AD realm with anonymous search [TUTORIAL] - Developer Workstation (Proxmox-VE 9) with cinnamon (LMDE7) SDN zone shows "pending" on peer nodes after node reboot (9.2.x) Cluster not quorate - extending auth key lifetime! Proxmox not rebooting properly (SOLVED) Proxmox 9 Stuck on loading initial ramdisk With new HA-Disarm Feature is there a Documentation for NUT Setup on Clusters? Proxmox 8.3 Installation Issue on ProLiant DL380 Gen9 Cluster networking setup LXC System images unavailable [SOLVED] - Fix: NVIDIA Drivers Failing after upgrade to Proxmox 9.2.2 (Kernel 7.0.2-6-pve) / NovaCore Conflict Install NUT directly on Proxmox VE and control guests from here driver usb for windows 7 System startup error and no network: Failed to start ifupdown2-pre.service - Helper to synchronize boot up for ifupdown. PBS backup space grow up constantly Proxmox Datacenter Manager 1.1 released! IPv4 not available in newly created VM Recommended Setup for Offsite Proxmox Backups? Hetzner Storage Box & Remote PBS Challenges duplicate, please delete this passthrought an USB device "by ID" to CT PDM Installer Freezes at 66% Tried PDM for the first time (version 1.1) - had issues PDM 1.1 automated install Suche Server-Provider für Proxmox connecting sdn to edge firewall SDN, IPAM & DHCP Migrating from read-only file system Ubuntu 26.04 installation fails for unknown reason Status Unbekannt nach Cluster Join Installing Proxmox Backup Server on Mac Mini (Late 2012) kernel 7.0 performance issue with zfs pools PVE becomes unreachable via ethernet but OS is running [SOLVED] - New 9.2 install - can't find 7.0.2-6-pve , not all the time [SOLVED] - Backup and dedupe a VM with LUKS Gibt es mit PVE 2.x ggf. Änderungen bei der RAM-Nutzung, bzw. deren Anzeige bei VMs? I need help for setting up backup solution Way more NAGware, very little functionality, bugs galore Root squashing virtiofsd with --uid-map Intel ixgbe Driver Update Fail Passkey Login (not 2FA) Roblox VM detection - can be overcome? [TUTORIAL] - ZFS-Autosnaptshot inkl. Rollback und Daten direkt recovern (Windows/Linux) How to stop PVE Kernel upgrade [SOLVED] - very long waiting to log in to lxc debian 11 ssh [TUTORIAL] - Configuring Fusion-Io (SanDisk) ioDrive, ioDrive2, ioScale and ioScale2 cards with Proxmox Increase maximum USB devices in vm.conf
Ceph 3-node cluster, VM I/O freeze after node reboot/update
invalid@exam · 2026-06-12 · via Proxmox Support Forum

Hi everyone,

I am investigating an issue with a 3-node Proxmox/Ceph cluster and would like to ask if anyone has seen a similar failure mode before.

Environment​

  • 3 Proxmox nodes
  • PVE: 9.2.2
  • Ceph version: 19.2.3
  • Ceph network: 10.0.50.0/24
  • public_network and cluster_network are currently on the same network
  • Ceph interface: bond1
  • bond1 is 2 × 10G LACP / 802.3ad
  • MTU 9000 on the Proxmox side
  • NICs: Broadcom BCM57412 NetXtreme-E 10GbE, driver: bnxt_en
  • Switch: Huawei S6730-H48X6C stack
  • Pools are size=3, min_size=2

Current Linux bonding status on all three nodes looks clean from a LACP point of view:

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
LACP active: on
LACP rate: slow
Number of ports: 2
Both slaves: 10000 Mbps/full
Aggregator ID: same on both slaves
Actor/Partner Churn State: none
Link Failure Count: 0


# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class nvme
device 1 osd.1 class nvme
device 2 osd.2 class nvme
device 3 osd.3 class nvme
device 4 osd.4 class nvme
device 5 osd.5 class nvme
device 6 osd.6 class nvme
device 7 osd.7 class nvme
device 8 osd.8 class nvme
device 9 osd.9 class nvme
device 10 osd.10 class nvme
device 11 osd.11 class nvme

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root

# buckets
host server1 {
id -3 # do not change unnecessarily
id -4 class nvme # do not change unnecessarily
# weight 27.94519
alg straw2
hash 0 # rjenkins1
item osd.0 weight 6.98630
item osd.1 weight 6.98630
item osd.2 weight 6.98630
item osd.3 weight 6.98630
}
host server2 {
id -5 # do not change unnecessarily
id -6 class nvme # do not change unnecessarily
# weight 27.94519
alg straw2
hash 0 # rjenkins1
item osd.4 weight 6.98630
item osd.5 weight 6.98630
item osd.6 weight 6.98630
item osd.7 weight 6.98630
}
host server3 {
id -7 # do not change unnecessarily
id -8 class nvme # do not change unnecessarily
# weight 27.94519
alg straw2
hash 0 # rjenkins1
item osd.8 weight 6.98630
item osd.9 weight 6.98630
item osd.10 weight 6.98630
item osd.11 weight 6.98630
}
root default {
id -1 # do not change unnecessarily
id -2 class nvme # do not change unnecessarily
# weight 83.83557
alg straw2
hash 0 # rjenkins1
item server1 weight 27.94519
item server2 weight 27.94519
item server3 weight 27.94519
}

# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map


What happened​

During maintenance we upgraded/rebooted the nodes one after another.

The order was roughly:

  1. server1 was upgraded/rebooted first.
  2. server2 was upgraded/rebooted next.
  3. The outage happened shortly after server2 came back.
  4. server3 had not yet been upgraded/rebooted at the time the outage started.

After server2 rejoined, Ceph/RBD I/O became stuck. VMs froze. Logs showed Ceph OSD heartbeat problems between server1 and server2.

Examples from the syslog:

heartbeat_check: no reply from 10.0.50.11
heartbeat_check: no reply from 10.0.50.10
slow ops

Eventually all hosts were rebooted and the cluster recovered.

Interesting observations​

1. One LACP member per server has carried almost no traffic for at least a year​

We checked interface graphs for the physical Ceph NICs.

For each node, one of the two LACP member interfaces has basically no useful traffic on it for at least a year. The graph shows only a few hundred bit/s average and occasional tiny kbit/s spikes, which looks like LACP/LLDP/control traffic only.

So although the Ceph network is physically 2 × 10G per node, it appears to have been effectively using only one 10G member per node.

The Linux bonding policy is currently: Transmit Hash Policy: layer2 (0)
This may explain the poor distribution, because with only three Ceph nodes there are very few MAC pairs. However, I am surprised that the pattern is so consistent over such a long time.

2. The active Ceph ports seem to be on the same stack member​

LLDP shows that the Ceph NICs are connected like this:

server1:
enp194s0f0np0 -> XGigabitEthernet1/0/48 -> PortAggregID 21
enp194s0f1np1 -> XGigabitEthernet0/0/1 -> PortAggregID 21

server2:
enp194s0f0np0 -> XGigabitEthernet1/0/46 -> PortAggregID 22
enp194s0f1np1 -> XGigabitEthernet0/0/3 -> PortAggregID 22

server3:
enp194s0f0np0 -> XGigabitEthernet1/0/1 -> PortAggregID 23
enp194s0f1np1 -> XGigabitEthernet0/0/48 -> PortAggregID 23

So each server has one Ceph link on stack member 1 and one Ceph link on stack member 0.

Based on our monitoring, it looks like the useful Ceph traffic has historically been on the same side/member, while the other physical link is mostly idle.

3. Switch output drops were reported, but the traffic graph does not show a clear spike at the outage time​

The data center reported Huawei switch logs with output queue drops / congestion messages on one of the Ceph member ports.

However, when looking at traffic graphs around the outage time, we do not see a clear bandwidth spike. In fact, the interface traffic seems to drop to zero shortly after the problem starts.
1781027803070.png
This makes me unsure whether simple port congestion is really the root cause. It feels more like a temporary forwarding/blackhole/LACP/stack/NIC issue than just “the 10G link was overloaded”.

4. MTU​

On the Proxmox side, bond1 is configured with MTU 9000. LLDP from the Huawei switch shows: MFS: 9216 on the relevant Ceph ports.
Jumbo ping tests with DF have worked without any problems.

5. Broadcom bnxt_en messages​

On boot, the Broadcom NICs show messages like:

hwrm_tunnel_dst_port_alloc failed. rc:-95
UDP tunnel port sync failed port 4789 type vxlan: -95

These appear on the Broadcom interfaces. Ceph itself is not using VXLAN, so I am not sure whether this is relevant or just an unrelated offload/firmware warning.

Flow control is disabled.

Questions​

Has anyone seen a similar issue?

Any suggestions for specific counters or tests would be appreciated.