惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

酷 壳 – CoolShell
酷 壳 – CoolShell
H
Hacker News: Front Page
P
Palo Alto Networks Blog
T
ThreatConnect
Apple Machine Learning Research
Apple Machine Learning Research
博客园_首页
T
True Tiger Recordings
P
Privacy & Cybersecurity Law Blog
B
Blog
IT之家
IT之家
Last Week in AI
Last Week in AI
F
Full Disclosure
Hacker News: Ask HN
Hacker News: Ask HN
C
Comments on: Blog
Microsoft Azure Blog
Microsoft Azure Blog
C
Cybersecurity and Infrastructure Security Agency CISA
Microsoft Security Blog
Microsoft Security Blog
博客园 - 【当耐特】
N
News and Events Feed by Topic
NISL@THU
NISL@THU
腾讯CDC
雷峰网
雷峰网
Security Latest
Security Latest
李成银的技术随笔
M
Microsoft Research Blog - Microsoft Research
L
LangChain Blog
L
Lohrmann on Cybersecurity
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
C
Check Point Blog
Y
Y Combinator Blog
Recent Announcements
Recent Announcements
博客园 - Franky
N
News | PayPal Newsroom
V
V2EX
A
About on SuperTechFans
The Register - Security
The Register - Security
月光博客
月光博客
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
MyScale Blog
MyScale Blog
Cisco Talos Blog
Cisco Talos Blog
Vercel News
Vercel News
WordPress大学
WordPress大学
C
Cyber Attacks, Cyber Crime and Cyber Security
The Hacker News
The Hacker News
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
爱范儿
爱范儿
A
Arctic Wolf
L
LINUX DO - 最新话题
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More

DEV Community

Why Code Golfing is the Ultimate Test for Multimodal LLMs (And a New Benchmark to Prove It) Decoding Solana Account Data: Three Methods Compared MCP Just Landed on Your Phone: What Google AI Edge Gallery Actually Does I Made My Website "Alive" using Physics (Vanilla JS Experiment Part 2) 🚀 Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test C] How to Prompt AI Tools to Write Accurate SQL Queries (And Why Most Developers Get This Wrong) Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test B] Stop Asking “Which Model?” and Start Fixing Your Team’s AI Supply Chain [Image Test A] PayPal and Stripe Are Not the Answer for Global Digital Sales Signs your WordPress site needs a headless CMS rebuild Sanity CMS vs Contentful for Next.js projects: an honest comparison Sanity vs Strapi vs Payload CMS: an honest comparison for 2026 Sanity CMS website cost in 2026: what founders actually pay INP for React Apps: Profiling and Eliminating Long Tasks Why Core Web Vitals Matter (and How I Improve Them) Why AI Agents Love Boring Code I got tired of manual WordPress maintenance across 8 client sites - so I automated all of it My PR Merged Into a Graveyard: On the Rise of Antigravity and the Fall of Open Source Private AI on a Normal Android Phone: Building Krexel with Gemma 4 E2B The Best Result This Week Was a Failed Prediction — Phase-3a Doesn't Transfer Embedding 685 million texts in 32 minutes I Asked the Top 6 AI Chatbots to Sell Me on Themselves - Then Asked Each One Who Came Second Hello World JahSeeToo The First Malaysia's Hacker i watched google tear down the old internet from a hostel room in kolkata How I audit and prune unused Sanity document types to reclaim Studio performance What is MCP, and why it's the missing layer between AI and your CRM Stop adding print statements to debug your data pipeline — use watcher instead Hire a Sanity developer vs agency: five honest trade-offs Temporal vs Make for API-First Workflows The Antigravity 2.0 Forced Update: How to Fix the Broken Editor Loop 10 Ways To Reduce Your LLM API Costs mcp-probe v1.0.0: A CI readiness gate for MCP servers Building ValoVault: The Per-Agent Skin Loadouts Riot Never Shipped Most CMS Platforms Aren’t Built for Full Lifecycle Ownership 45 MB of Claude Code Sessions You Don't See Building a Resilient Checkout in NestJS: Retry, Idempotency, and a System That Tunes Itself Html learning journey I built an open-source alternative to ViciDial. Here's the stack, and the bugs that ate my nights. Zero-PC Architecture: Deploying Webhooks & AI Triage from a Mobile Footprint Why AI Coding Agents Fail Senior Engineers (And What I Built to Fix It) Stop Pasting URLs into Security Header Sites - Use This CLI 26 of 39 AI Companies Use SPF Softfail — Their Email Can Be Spoofed Mastering useRef in React: The Hook That Gives React Memory Without Re-Rendering One Brain, Many Hands: Building a Parallel Task Orchestrator for AI Agents Understanding useRef in React: Concepts, Use Cases, and Examples An AI That Can't Trade, a Human That Can't Say No SSH died. Spent 3 hours fixing the wrong thing. ## Rise of the Managed Agent: Why Antigravity 2.0 is Google I/O 2026’s Most Critical Developer Release From Concept to Production: A Technical Guide to Deploying Markus Multi-Agent Systems Why Browsers Outpaced Web Tooling (And How We Catch Up) First Principles Building a Safety-First RAG Triage Agent in Python Gemma 4 Isn’t Just Another AI Model — It’s A Shift In How We Build AI The Feature Store: Consistency and Latency Are Both Non-Negotiable What did gemma see? - Thinking in comments... I Built a Desktop Chat App for Running Local LLMs Offline Alert Fatigue Is a Design Choice: Building Views That Actually Help Building A Laravel Google Sheets Package That Imports, Exports, Caches, Formats, And Tests Cleanly DOM Accessibility Tree Extraction: A Reliable Method for LLMs on Dynamic Web Tables Building a Production Grade AWS Infrastructure Project (Part 1) Google just shifted the agent workflow from the cloud to the desktop I built a Claude skill that keeps your AI coding tools from contradicting each other — and I need beta testers Google I/O 2026 - Day 1 - Live from the Front Row The Effect of Frosted Glass (Glassmorphism) in Pure CSS in 2026 Gemini vs. ChatGPT for Coding: A Developer's Guide Solana's Account Types Are Just Database Rows With Different Flags Cryptographic Forensics for AI Coding Agent Sessions Testing NGB Platform Beyond a Small Demo Dataset with k6 and TypeScript Metabase 61: AI fun police, build questions and dashboards with MCP, and much more! How GBase 8a Rough Index Works: Block‑Level Pruning for 10x Faster Queries The Anti-Antigravity Bulkhead vs Rate limiting. The Age of Accountable Agents: Building Trust in Your AI Automation Securing Your AI Agents: Essential Practices for On-Device Automation I benchmarked OpenAI's new GPT-Realtime-Translate against four other live translation systems The Code Nobody Will Delete Building a desktop studio for interactive video stories like Late Shift - Devlog #1 Solving the Local AI Sandbox Issue: How TaigaAI Keeps Your Workstation Safe Why Enterprises Will Struggle With MCP — And What to Do About It Why I Finally Added a Blog to My Converter Tool When Your Coding Agent's String-Matcher Becomes a Billing Decision Building ThreatPulse IDS: An AI-Powered Intrusion Detection System I Built a Register-VM JavaScript Engine in Rust with opencode.ai — Beating QuickJS Per-User OAuth for AI Agents: Why It Matters and What to Look For You Got Your Whole Genome Sequenced. Now What? Zero to Full-Stack in 6 Months: The Izzy Way... PasteCheck v1.3 — what I improved after launching and getting real users DeepSeek V4 on Huawei's Ascend 950: A Real Stress Test for China's AI Chip Ecosystem How Strong Is "Strong"? Password Entropy in Plain English Precision Mechatronics: Mitigating Step-Pulse Resonance and Thermal Dissipation in Micro-Stepping Hardware Controllers A Fact A Day, an autonomous Podcast as my entry 4 Hermes Agent Challenge #100DaysOfSolana Day29: My Experience Generating Token On Solana Devnet Overcoming Challenges and Applying Best Practices in Migrating Large JavaScript Codebases to TypeScript Decostruire lo Streaming di FC2: Come Costruire un Downloader ad Alte Prestazioni con HLS e WebAssembly Top 10 Agentic AI Frameworks Compared: LangGraph vs CrewAI vs AutoGen vs... (Benchmarks Inside) How I Built a Hermes Agent for Lead Generation That Finds and Qualifies Better Prospects The Hybrid Method: when Claude.ai supervises Claude Code LLMs Are Probabilistic. Your Workflow Shouldn't Be. Deploying Tempo Distributed Tracing Backend on Ubuntu 24.04
Troubleshooting OpenStack Instance I/O Errors: A Ceph Blocklist Case
Sajit Maharj · 2026-05-18 · via DEV Community

How stale Ceph RBD locks and blocklisted clients caused OpenStack VMs to fail after a datacenter power outage — and how we recovered them.

After the power restoration, the Ceph cluster reported HEALTH_OK and OpenStack services appeared operational. A test VM booted successfully. However, all pre-existing VMs failed to start, dropping into initramfs with I/O errors before reaching the root filesystem:

No init found. Try passing init= bootarg.

BusyBox v1.36.1 (Ubuntu 1:1.36.1-6ubuntu3.1) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)

Enter fullscreen mode Exit fullscreen mode

The key observation: newly created VMs worked without issues. This indicated the problem was specific to the relationship between existing VMs and their storage, rather than network or storage infrastructure issues.

Root Cause Analysis

The issue stemmed from Ceph's RBD exclusive locking mechanism. This feature prevents simultaneous writes to the same image from multiple clients, avoiding data corruption. When a compute node connects to an RBD volume, it acquires an exclusive lock; when disconnected cleanly, it releases the lock.

During the power outage, compute nodes lost power without clean disconnection. When they returned, they appeared as untrusted clients:

$ ceph osd blocklist ls
10.88.10.91:0/3853293677 2026-05-06T08:59:47.102488+0000
10.88.10.90:0/316670229 2026-05-07T00:26:11.581329+0000
10.88.10.90:0/3783311129 2026-05-07T00:26:11.581329+0000
...
listed 14 entries

Enter fullscreen mode Exit fullscreen mode

Ceph blocklists clients that crash without releasing locks to prevent zombie processes from corrupting data. The old locks remained held by client IDs that no longer existed, creating a deadlock where VMs needed the locks to boot, but the locks were held by processes that would never release them.

Resolution

Verifying Lock State

We verified the theory by checking an affected volume's lock state:

$ rbd lock list --pool volumes --image volume-48ed0d20-f065-4536-b3f2-eac5f3abc5be

There is 1 exclusive lock on this image.
Locker          ID                    Address
client.3406724  auto 135766063836400  10.88.10.91:0/3853293677

Enter fullscreen mode Exit fullscreen mode

The address matched a blocklisted entry. The lock was held by a client that would not return to release it.

Removing Stale Locks

The command syntax for force-removing an RBD lock requires positional arguments with quoted strings:

rbd lock remove volumes/volume-48ed0d20-f065-4536-b3f2-eac5f3abc5be \
  "auto 135766063836400" "client.3406724"

Enter fullscreen mode Exit fullscreen mode

Verification:

$ rbd lock list --pool volumes --image volume-48ed0d20-f065-4536-b3f2-eac5f3abc5be
No locks on this image.

Enter fullscreen mode Exit fullscreen mode

The VM rebooted successfully.

Bulk Resolution

For multiple affected volumes, we used the following script:

for vol in $(rbd ls volumes); do
  locks=$(rbd lock list volumes/$vol 2>/dev/null)
  if echo "$locks" | grep -q "client"; then
    echo "Removing lock on: $vol"
    lock_id=$(rbd lock list volumes/$vol | awk 'NR==3{print $2" "$3}')
    locker=$(rbd lock list volumes/$vol | awk 'NR==3{print $1}')
    rbd lock remove volumes/$vol "$lock_id" "$locker"
    echo "Done: $vol"
  fi
done

Enter fullscreen mode Exit fullscreen mode

Then hard rebooted all affected VMs:

for vm in $(openstack server list --all-projects -f value -c ID); do
  name=$(openstack server show $vm -f value -c name)
  status=$(openstack server show $vm -f value -c status)
  echo "Rebooting: $name ($vm) - Current status: $status"
  openstack server reboot --hard $vm
done

Enter fullscreen mode Exit fullscreen mode

All VMs recovered.

Clearing Blocklist Entries

After confirming all locks were released and VMs were healthy, we cleared the blocklist entries:

ceph osd blocklist rm 10.88.10.90
ceph osd blocklist rm 10.88.10.91

Enter fullscreen mode Exit fullscreen mode

Important: Only perform this step after confirming crashed nodes will not return with stale state. Reconnecting zombie processes while another client holds the lock risks data corruption.

Prevention Measures

Granting OpenStack Blocklist Capabilities

OpenStack requires specific Ceph capabilities to manage blocklist entries automatically. Without allow command "osd blocklist" in its monitor capabilities, Nova cannot clear stale entries.

Step 1: Check current capabilities

ceph auth get client.openstack

Enter fullscreen mode Exit fullscreen mode

Step 2: Add blocklist capability

# First, save existing OSD caps
ceph auth get client.openstack -o /tmp/openstack.keyring

# Then update caps (adjust pool names and OSD caps for your environment)
ceph auth caps client.openstack \
  mon 'allow r, allow command "osd blocklist"' \
  osd 'allow class-read object_prefix rbd_children, allow rwx pool=images, allow rwx pool=volumes, allow rwx pool=vms, allow rwx pool=backups'

Enter fullscreen mode Exit fullscreen mode

Note: Adjust pool names according to your environment (e.g., vms, volumes, images).

Step 3: Verify the update

ceph auth get client.openstack

Enter fullscreen mode Exit fullscreen mode

Nova Configuration Tuning

The following settings were added to nova.conf on compute nodes:

[libvirt]
hw_disk_discard = unmap
disk_cachemodes = network=writeback
rbd_io_timeout = 30

Enter fullscreen mode Exit fullscreen mode

The rbd_io_timeout parameter gives the RBD client additional time to recover during transient issues rather than immediately failing I/O.

Key Takeaways

  1. Ceph's blocklist mechanism protects data from split-brain scenarios. The issue arises from unclean shutdowns leaving orphaned locks behind.

  2. New VMs working while existing VMs fail is a diagnostic indicator. This pattern after an outage strongly suggests blocklist-related issues, avoiding time spent investigating network or OSD problems.

  3. Proactively grant blocklist permissions to the OpenStack Ceph client. The allow command "osd blocklist" capability enables automatic recovery without manual intervention.

  4. The rbd lock remove syntax requires positional arguments with quoted strings. The --locker flag is not available in many versions. Use the format:

rbd lock remove <pool>/<image> "<lock_id>" "<locker>"

Enter fullscreen mode Exit fullscreen mode

  1. Include lock-related failure scenarios in disaster recovery testing. Standard monitoring and backup verification may not catch this failure mode.

`

Enter fullscreen mode Exit fullscreen mode