惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
E
Exploit-DB.com RSS Feed
Google DeepMind News
Google DeepMind News
F
Full Disclosure
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
ThreatConnect
Stack Overflow Blog
Stack Overflow Blog
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
G
GRAHAM CLULEY
C
Check Point Blog
T
Threatpost
I
Intezer
Spread Privacy
Spread Privacy
The Register - Security
The Register - Security
Project Zero
Project Zero
月光博客
月光博客
人人都是产品经理
人人都是产品经理
阮一峰的网络日志
阮一峰的网络日志
D
DataBreaches.Net
IT之家
IT之家
Malwarebytes
Malwarebytes
T
The Blog of Author Tim Ferriss
P
Privacy International News Feed
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
量子位
李成银的技术随笔
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cisco Talos Blog
Cisco Talos Blog
Know Your Adversary
Know Your Adversary
美团技术团队
The GitHub Blog
The GitHub Blog
T
Tor Project blog
M
MIT News - Artificial intelligence
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Comments on: Blog
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
NISL@THU
NISL@THU
The Cloudflare Blog
H
Help Net Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026
Post Mortem: The Ugly, the Bad & the Good
Cloudflare Team · 2012-02-25 · via The Cloudflare Blog

2012-02-24

4 min read

Last night was not our finest hour. Around 07:30 GMT, we were finishing up a push of a new DNS infrastructure. The core of what this new update was built to do is make DNS updates even faster. Before it took about a minute for a change to your DNS settings to propagate to all our infrastructure, with the new DNS update it is almost instant. That is important to understand in order to understand what went wrong.

Making an update to the DNS requires changing underlying code deep in our system and taking servers offline while we do so. We scheduled the update for the quietest time on our network, which is around 07:00 GMT (around 11:00pm in San Francisco). The code had been running smoothly in our test environment and one data center for the last week so we were feeling pretty good. And, in fact, the push of the DNS update went smoothly and was ahead of schedule.

The Ugly

When the update was complete in 10 of our 14 data centers we got word of a minor issue that was affecting some data getting pushed from the primary DNS database. In the process of diagnosing the minor issue, the primary DNS database was deleted. The new DNS system did its job and rapidly propagated across the 10 datacenters where the update was live. The result was that if recursive DNS looked up a domain and hit one of those 10 datacenters, around 07:30 GMT they would receive an invalid result. That meant those sites went offline and it was entirely our fault.

The Bad

The DNS database is regularly backed up, but it took us about 5 minutes to recognize the issue, retrieve the backup, and push it to production. Our new DNS infrastructure pushed the update out to most of the datacenters immediately, but because it was such a large update it took a few minutes to rebuild. In most places, new DNS requests were correctly answered with less than a 10 minute window of bad results.

Unfortunately, DNS is a series of interconnected caches, many of which are not in our control. If you accessed a page during the issue, your ISP's recursive DNS likely cached the result. Since most DNS providers don't make it easy to flush their cache (compared with a recursive provider like OpenDNS, which does) it extended the outage for people who were already seeing an issue. Generally, within 30 minutes, recursive DNS had flushed and by 8:00 GMT sites were back online.

Two datacenters did not take all the corrected DNS file updates correctly. We are still investigating why, but our speculation is that because the update affected a large number of records the systems choked on the initial attempt at the updates. Requests that hit those data centers returned bad results for some sites until about 8:10 GMT. Some visitors in Europe and Asia would have seen a longer period of downtime on some sites as a result. Our system has multiple layers of redundancy, including at the datacenter level, so we removed the two data centers from rotation as soon as we recognized the issue and affected visitors once again saw correct DNS results.

Two last problems exacerbated things. First, as is normal operations for us, we were dealing with two mid-sized DDoS attacks directed at some of our customers at the time. Nothing abnormal about that, but having two fewer data centers in rotation made us less effective at stopping themand caused a small handful of 500 errors. The impact of those, however, was minimal (less than 0.001% of traffic for around a 12 minute period). Second, there were some DNS entries in our system for TLDs like co.nz that shouldn't have been there. While it wasn't a validated DNS zone record, the way that the DNS update was pushed caused a handful of records that fell under these TLDs to also see an extended outage. When we got reports of this we identified the issue and removed the problematic entries.

The Good

There's not a ton of good in this incident itself. While the system status is green now, we will memorialize the incident on our system status page. I, along with the rest of the team, apologize for the problem and anyone who experienced it. We've built a system that is resilient to most attacks, but a mistake on our part can still cause a significant issue. This is the second significant period of downtime we've had network wide. The first was more than a year ago and also occurred due to an error we made ourselves. Any period of downtime is unacceptable to us and, again, we sincerely apologize.

Going forward, we've already added several layers of safeguards to prevent this, or a similar incident, from occurring. CloudFlare's technical systems are designed to learn over time, that same ethos is in our team itself. While this incident was ugly, I was proud to see almost the entire engineering, ops, and support teams online into the wee hours helping customers sort out issues and building the safeguards to prevent an issue like this in the future.

What I was planning on writing a blog post about this morning is our new DNS infrastructure, so I will end with a bit more detail on that. As described above, one of the main benefits is that DNS updates are even faster than before. In the past, DNS files were replicated every minute or so. Now changes are pushed instantly to our entire network. While that wasn't a great thing last night, in general we believe it is a big benefit to our publishers and makes us the fastest updating global authoritative DNS in the world.

The update to the DNS systems also includes hardening against some of the new breed of DNS-directed DDoS attacks we've begun to see. Going forward, this will help us provide even better protection against larger and larger attacks. Our goal is to stay ahead of the attackers and ensure that everyone on CloudFlare has state-of-the-art protection against attacks.

I apologize again for those of you who experienced downtime as a result of our mistake. We will learn from it and continue to build redundancy and resiliency into CloudFlare in order to earn your trust.

Post MortemDNS

Related posts

February 21, 2026

Cloudflare outage on February 20, 2026

Cloudflare suffered a service outage on February 20, 2026. A subset of customers who use Cloudflare’s Bring Your Own IP (BYOIP) service saw their routes to the Internet withdrawn via Border Gateway Protocol (BGP)....

    By