惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

F
Full Disclosure
Recorded Future
Recorded Future
T
Tenable Blog
S
Securelist
C
CERT Recently Published Vulnerability Notes
T
Threatpost
S
Schneier on Security
A
Arctic Wolf
The Hacker News
The Hacker News
C
CXSECURITY Database RSS Feed - CXSecurity.com
Know Your Adversary
Know Your Adversary
P
Privacy International News Feed
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
The Register - Security
The Register - Security
Cisco Talos Blog
Cisco Talos Blog
AWS News Blog
AWS News Blog
K
Kaspersky official blog
T
True Tiger Recordings
T
Threat Research - Cisco Blogs
V
Vulnerabilities – Threatpost
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
小众软件
小众软件
B
Blog
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
Microsoft Azure Blog
Microsoft Azure Blog
Cyberwarzone
Cyberwarzone
C
Cybersecurity and Infrastructure Security Agency CISA
T
Tor Project blog
Spread Privacy
Spread Privacy
Malwarebytes
Malwarebytes
P
Proofpoint News Feed
F
Fox-IT International blog
F
Fortinet All Blogs
P
Privacy & Cybersecurity Law Blog
G
GRAHAM CLULEY
量子位
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
博客园 - 叶小钗
Project Zero
Project Zero
T
Tailwind CSS Blog
N
Netflix TechBlog - Medium
Martin Fowler
Martin Fowler
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
I
Intezer
博客园_首页
腾讯CDC
H
Hackread – Cybersecurity News, Data Breaches, AI and More
D
Darknet – Hacking Tools, Hacker News & Cyber Security

DEV Community

Building a DAG Workflow Orchestration Engine from Scratch in Python PicoCTF Web Challenge Writeup: Failure Failure An AI Agent Wiped a Production Database in 9 Seconds. What Engineers Must Design Before Shipping. Why HEIC to JPG Is Still a Massive Problem for iPhone Users? How I Fixed a CSS Animation Bug in an Open Source React Library Why Your API Gateway Might Be Your Biggest Compliance Liability Liquidity Pool Analyzer — Zero-Dep Python CLI for Solana DEX Data What AI Leaders Are Really Worried About in 2026 5 ways AI agents quietly die inside n8n production LLM-as-judge variance broke our DPO training signal for 3 weeks I Tracked Revenue Per User for 6 Months — Here's Why ARPU Beats ARPPU for Channel Decisions 2026 I stopped trying to build a “productivity app.” How to Build a HIPAA-Compliant Healthcare App in React Native (2026) Veltrix Was Losing Events in Plain Sight—Heres the Flame Graph That Proved It Anthropic Self-Hosted Sandboxes + MCP Tunnels: Enterprise AI Agents That Keep Your Data Behind Your Walls Understanding Closures in JavaScript: A Complete Beginner Guide Most expense trackers expect perfect English. But real users type in Hindi, Hinglish, mixed language, and natural conversation. So I built https://vitmora.com to understand the way people actually type. I Got Tired of Messy Bookmark Managers, So I Built My Own HackTheBox: DarkZero Writeup The seam I Built an AI Expense Tracker That Understands the Way People Actually Type I built a Chrome extension after my kid turned my YouTube feed into Roblox Building a Production MCP Server in Laravel How Our Event-Driven Pipeline Blew Up Because We Trusted the Default Config Looping in Python I Built a Retro Gaming Console Using ESP32 and OLED Display 🎮 ORA-00255 오류 원인과 해결 방법 완벽 가이드 Why Hytale Treasure Hunt Servers Throttle at 100 Players (And How We Fixed It) Product Update: Post-Quantum Cryptography meets <1s Kubernetes Syncs ECS vs EKS vs Lambda: How to Pick the Right AWS Compute Service (2026) Shopify fired the webhook. My server never processed it. Here's how I catch that now. Understanding React: Components, JSX, Virtual DOM, and More Stage 0.2 — Operating System Fundamentals I Didn’t Need Another Markdown App. So I Built This Instead. ClickUp Alternatives for Solo Freelancers Who Want Less Complexity The Gods That Ate the Engineers "My AI Agent Kept Missing Buttons, So I Used Windows UI Automation" Manejo de errores en Go - Primeros pasos The Treasure Hunt Engine Blew Up My Inbox at 3 AM Curing Telegram Information Overload: How I Automate Deal Hunting with AI and MTProto Read-Modify-Write isolation in NoSQL, part 2: When the invariant spans multiple aggregates. The Code Runs. The System Runs Too. How I secured my FastAPI app - 6 vulnerabilities fixed in one session with gstack /cso The Day the Treasure Hunt Engine Stopped Beeping The bf16 grad accumulator that killed our SDXL LoRA training I Still Have Nightmares About the Time Our Hytale Server Crashed Under Load Stop Using Global State: Master Localized React Context ⚡ Build a Private AI Search on Your Device: Local RAG in the Browser Stop Freezing Your API: Async Email Delivery in Laravel An AI Agent Wrote and Sold Her Own Prompt Collection Solana Validator Stake Checker CLI — Track Decentralization from Your Terminal Mouse Unlock!—no password, just a secret click pattern Reloading Textures in Blender Is a Pain — I Made a Free Add-on for That AI Agents Don't Log In. That's Why Your Entire Security Stack Is Flying Blind Claude Cowork has changed managing a Figma design system library forever Bayesian Knowledge Tracing in 37 lines of Python — how NumPath models what a student knows Two Cross-Platform Bugs in Our Go CLI (And How We Fixed Them) Two Knowledge Hierarchies: Structuring Context for AI Agents and LLMs The Day Treasure Hunt Broke My Caches—And How We Fixed It From Figma to production React, with AI in the loop Built a Sentiment Analysis Web App – My First Full-Stack ML Project I built a zsh cleanup script for macOS dev machines — and learned more than I expected AI 3D tools need product evals, not benchmark faith AI Prompt Injection Defense: Building Effective Strategies in 5 Steps Treasure Hunt Engine Blew Up When We Asked It To Grow I Tried Self-Hosting Open Source AI Models. Here's Why I Went Back to APIs. Enterprise vs Startup AI APIs — The Architectural Decision Nobody Talks About I Cut My AI API Bill from $420 to $28/Month — Here's Exactly How ENS Resolver CLI — Look Up Any ENS Name from Your Terminal 🚀 My Journey Begins on DEV Community — Building Startups, Communities & AI-Powered Solutions Using AI Chat Is Not the Same as Using an AI Agent The Cache That Bled — How We Turned Veltrix Event Config From Silent Killer to Silent Savior Designing a Modular Wiring Harness for Multi-Function Vehicle Trackers Reviving a 12K+ Star Abandoned Library: toastr-next v3 🍞 The Day the Language Became the Bottleneck winston vs pino in 2026: A Production-Tested Comparison HTB: MonitorsFour - Full Walkthrough Fixing your writing tone with a Chrome extension Experimented to fork AWS infra graph and simulate what breaks before you deploy Industrial SEO at 100 Pages/Week: My n8n + Claude Code + RAG Stack I Built a Kubernetes Alternative. It Changed My Perspective on Complexity. Chronos vs Toto: Zero-Shot Forecasting Benchmark Results Edge-Cached Localhost Tunnels: How to Give Stakeholders a Production-Fast Preview Directly from Your IDE Radiation-Proof Flash Storage Could Be the Missing Layer for AI Data Centers in Space AI Learning Roadmap: Where to Start if You're a Complete Beginner I built 6 free dev tools to skip the signup walls — here's what I learned How to Set Realistic Goals for an Open Source Project? How I Built an Indonesian NLP Parser That Understands Warung Owners, Then Abandoned It Keyboard shortcuts that fixed my editing flow I Built an AI-Native Productivity System Instead of Another AI Wrapper LogicNodes MCP bridge: Connecting Claude to real-world utility I Built a Stateful Research Agent Inside a Sandbox. Here's What the Numbers Actually Looked Like. From Credentials to Domain Admin: Support Machine Writeup logfx v1.0.0: One Logger for Development and Production The Day the Garbage Collector Slowed Down a Real-Time Treasure Hunt ARTIST: RL-Powered Tool Use for LLM Agents Explained Breaking the RL Flywheel: From Manual Grind to Instant Debugging When Your Treasure Hunt Engine Becomes a Scavenger Hunt for DevOps Nightmares BoxAgnts Introduction (3) — WebAssembly Sandbox Engineering a 100% Client-Side, $0 Server-Cost Document
The Fire That Reached the Backups: The OVHcloud Strasbourg Data-Centre Fire, 2021
Vivian Voss · 2026-05-27 · via DEV Community

Night scene outside a data centre. A tall, multi-storey server building stands fully ablaze, flames driving upward through its core and breaking from the upper floors, thick black smoke rolling into a dark sky. Fire engines with blue lights wait on wet ground at the base. In the foreground, a young developer with a pink cat-ear headset stands small against the scale of it, watching the building burn.

Tales from the Bare Metal — Episode 05

In the early hours of 10 March 2021, a fire began in a power room in Strasbourg. By morning an entire data centre had been destroyed and a second badly damaged. Around 3.6 million websites went offline. For a great many of those customers the sites came back within days. For some, they never came back at all, because the only copy of their data had been in the building that burned. The data loss is not the lesson of this episode. The lesson is that a backup can be complete, valid, restorable, and still worthless, if it shares a failure domain with the thing it is backing up.

The Incident

Shortly before 01:00 on 10 March 2021, fire broke out in a power room at OVHcloud's Strasbourg site, known as SBG. The site comprised several buildings. SBG2 was destroyed entirely. SBG1 was badly damaged, several of its rooms lost. SBG3 and SBG4 were not burned but were powered down as the site was made safe and the power infrastructure was gone.

The scale of the dependent estate became clear within hours. According to figures cited in the official investigation, roughly 3.6 million websites, corresponding to around 464,000 domain names, were unavailable at the height of the crisis, close to 18 per cent of the active IP addresses OVH had assigned over the preceding fortnight. Game servers, government sites, e-commerce shops and countless small businesses went dark together. OVHcloud's founder communicated openly and frequently through the days that followed, and the company moved quickly to rebuild and to ship replacement capacity. But for customers whose only copy of their data lived on the SBG site, no amount of openness brought the data back. It was gone.

The Diagnosis

The fire started in the power room. The French Bureau of Investigation and Analysis on Industrial Risks (BEA-RI) published its report in June 2022. The report records high humidity readings near one of the power inverters in the hour before the fire began, and discusses the inverters as a likely origin, but it deliberately stops short of asserting a single definitive cause. That hedge is worth respecting: the precise ignition is not known with certainty, and inventing one would be dishonest.

What is understood, and what matters more for the lesson, is why a fire in one power room became the loss of a building. Three design facts compounded.

First, the cooling. SBG2 was built in 2011 using a tower design with free cooling, sometimes called auto-ventilation: rather than mechanical chillers, the building let the waste heat of the servers rise and vent at the top, drawing cooler outside air in at the bottom. As an energy strategy this is genuinely elegant and genuinely efficient. As a fire behaviour, a tall shaft with a strong natural updraught is, in the words that have followed the incident, rather like a chimney. The same airflow that cooled the servers fed and lifted the fire.

Second, the construction. The floors were wooden, rated to resist fire for about an hour. An hour is a long time at a desk and a short time against a fed fire in a ventilated tower.

Third, suppression. OVH had chosen not to fit any of the five buildings on the Strasbourg site with an automatic fire-extinguishing system. There were detection and human response and the fire brigade, but no gas or water system that triggers on its own in the room of origin in the first minutes, which are the minutes that decide whether a fire stays in one rack or takes a building.

None of these, on its own, is the villain. Together they meant that an event in one power room had very little standing between it and the whole structure.

The Context

The hard part of this story is not OVH's building. It is the customers' assumption, because that assumption is nearly universal and it is the part that travels.

The first condition is a mental model. We say "it is in the data centre" or "it is in the cloud" and we hear "it is safe". The phrase abstracts away the physical fact: a specific building, in a specific town, with specific walls and a specific power room. Almost nobody, choosing where their backup lives, pictures the building. The abstraction that makes cloud convenient is the same abstraction that hides the failure domain.

The second condition is the shape of the tools. A hosting panel offers a backup option, often a cheap one, and the nearest and cheapest option is frequently storage in the same data centre, sometimes the same building. The interface presents "backup" as a feature you switch on, not as a question about geography. So customers switched it on, in good faith, and their primary and their backup came to sit inside one failure domain, chosen by default rather than by decision. The word "backup" did all the reassuring; the location did all the risk.

The third condition is ownership. The location of a backup is rarely anyone's explicit, written requirement. It is a setting, a default, a checkbox during provisioning, and checkboxes have no owner. Restore-testing, the subject of this series' first episode, at least tends to land on someone's plate eventually. "Is our backup in a different failure domain from our primary?" is a question that, in a great many organisations, no one has ever been assigned to answer.

And all of it was reasonable at the time it was decided. The free-cooling tower was a real efficiency innovation that saved real energy for a decade. The single-site backup was a real saving that worked perfectly every day the building did not burn. These were not careless choices. They were ordinary trade-offs whose hidden assumption, the failure domains do not overlap, was simply never tested until a fire tested it for everyone at once.

The Principle

The rule that answers this is older than the cloud, and it is three numbers: 3-2-1. Keep three copies of your data, on at least two different kinds of media, with at least one of them off-site. The number that does the work here is the one: off-site, which does not mean a different rack or a different room, it means a different failure domain, far enough that a fire, a flood, a power surge or a flooded basement at the primary cannot reach it.

The previous episode of this series gave you the first commandment of backups: thou shalt not trust a backup thou hast not restored. This episode gives you the second, and they are not the same: thou shalt not keep that backup in the building thou art backing up. A backup you have diligently restore-tested every week is still not a backup if it burns in the same fire as the original. Restorability and separation are two independent axes, and you need both. GitLab, in episode one, had the separation and lacked the restorability. OVH's unluckier customers had neither guaranteed.

In the unixoid tradition the mechanics are unglamorous and well-proven. On FreeBSD, take a ZFS snapshot and zfs send it over SSH to a pool in another building, another region, or another provider entirely; a cron job and a receiving pool are the whole apparatus, and the stream is incremental after the first run. With restic or borg, back up to object storage in a different region, encrypted, deduplicated, with the repository somewhere the primary's misfortune cannot follow. The tooling is not the hard part and never was. The hard part is the decision to put the second copy somewhere the first copy's bad day cannot reach, and then to verify, with a restore, that it is really there.

Where It Travels

The OVH customers were not unusually careless. The same failure domain hides in nearly every modern stack, wearing the local vocabulary.

On AWS, the trap is the word "zone". Multi-AZ feels like redundancy, and against a single server or rack failure it is. But an Availability Zone is a cluster of buildings in one metropolitan area, and a region is the unit of geographic separation. A database replicated across AZs survives a host fault and may not survive a regional event; the off-site copy is cross-region replication (S3 CRR, cross-region snapshots), and it is a separate, deliberate setting.

On Azure, the distinction is the storage redundancy tier: locally-redundant storage (LRS) keeps the copies in one data centre, while geo-redundant storage (GRS) places a copy in a paired region hundreds of kilometres away. The cheaper default is the one that shares the postcode.

On Google Cloud, multi-region buckets and cross-region backups serve the same role, and the same default-versus-decision applies.

In Kubernetes, the cluster is the failure domain people forget. Velero backups and etcd snapshots that live on the same cluster, or in object storage in the same region, are a second copy in one place. Ship them off-cluster and off-region.

On-premises, the rule is at its most physical and most ignored. The backup NAS in the same server room as the production servers is not a backup; it is a second copy awaiting the same flood, the same power surge, the same fire. The unfashionable tape, written weekly and carried to a drawer across town or a safe-deposit box, has quietly saved more organisations than any cloud panel's backup toggle.

The shape is identical everywhere: a copy that shares a failure domain with the original is redundancy in name only. It survives the failures that do not matter much and dies in the one that does.

Coda

OVHcloud rebuilt, changed its designs, and the industry spent a fortnight reading think-pieces about fire suppression and free cooling. Both are worth reading. But the durable lesson of 10 March 2021 is not about cooling towers or wooden floors, which are OVH's to fix. It is about a sentence every team can check this afternoon without a single phone call to a vendor: where, physically, is our backup, and could the thing that kills our primary kill it too?

Redundancy that shares a postcode is decoration. The fire does not read your architecture diagram. It reads the floor plan.

Read the full article on vivianvoss.net →


By Vivian Voss, System Architect and Software Developer. Follow me on LinkedIn for daily technical writing.