惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
E
Exploit-DB.com RSS Feed
Google DeepMind News
Google DeepMind News
F
Full Disclosure
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
ThreatConnect
Stack Overflow Blog
Stack Overflow Blog
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
G
GRAHAM CLULEY
C
Check Point Blog
T
Threatpost
I
Intezer
Spread Privacy
Spread Privacy
The Register - Security
The Register - Security
Project Zero
Project Zero
月光博客
月光博客
人人都是产品经理
人人都是产品经理
阮一峰的网络日志
阮一峰的网络日志
D
DataBreaches.Net
IT之家
IT之家
Malwarebytes
Malwarebytes
T
The Blog of Author Tim Ferriss
P
Privacy International News Feed
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
量子位
李成银的技术随笔
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cisco Talos Blog
Cisco Talos Blog
Know Your Adversary
Know Your Adversary
美团技术团队
The GitHub Blog
The GitHub Blog
T
Tor Project blog
M
MIT News - Artificial intelligence
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Comments on: Blog
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
NISL@THU
NISL@THU
The Cloudflare Blog
H
Help Net Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026
A Post Mortem on this Morning's Incident
Cloudflare Team · 2016-06-21 · via The Cloudflare Blog

2016-06-21

4 min read

We would like to share more details with our customers and readers on the internet outages that occurred this morning and earlier in the week, and what we are doing to prevent these from happening again.

June 17 incident

On June 17, at 08:32 UTC, our systems detected a significant packet loss between multiple destinations on one of our major transit provider backbone networks, Telia Carrier.In the timeframe where the incident was being analysed by our engineers, the loss became intermittent and finally disappeared.

alt

Packet loss on Telia Carrier (AS1299)

Today’s incident

Today, Jun 20, at 12:10 UTC, our systems again detected massive packet loss on one of our major transit provider backbone networks: Telia Carrier.

alt

Packet loss on Telia Carrier (AS1299)

Typically, transit providers are very reliable and transport all of our packets from one point of the globe to the other without loss - that’s what we pay them for. In this case, our packets (and that of other Telia customers), were being dropped.

While Internet users usually take it for granted that they can reach any destination in the world from their homes and businesses, the reality is harsher than that. Our planet is big, and the Internet pipes are not always reliable. Fortunately, the Internet is mostly built on the TCP protocol which allows lost packets to be retransmitted. That is especially useful on lossy links. In most cases, you won’t notice these packets being lost and retransmitted, however, when the loss is too significant, as was the case this morning, your browser can’t do much.

Our systems caught the problem instantly and recorded it. Here is an animated map of the packet loss being detected during the event:

alt

CloudFlare detects packet loss (denoted by thickness)

Because transit providers are usually reliable, they tend to fix their problems rather quickly. In this case, that did not happen and we had to take our ports down with Telia at 12:30 UTC. Because we are interconnected with most Tier 1 providers, we are able to shift traffic away from one problematic provider and let others, who are performing better, take care of transporting our packets.

Impact on our customers

We saw a big increase in our 522s errors. A 522 HTTP error indicates that our servers are unable to reach the origin servers of our customers. You can see the spike and the breakdown here:

alt

Spike in 522 errors across PoPs (in reaching origin servers)

On our communication

Communicating in this kind of incident is crucial and difficult at the same time. Our customers understandably expect prompt, accurate information and want the impact to stop as soon as possible. In today’s incident, we identified weaknesses in our communication: the scope of the incident was incorrectly identified in Europe only, and our response time was not adequate. We want to reassure you that we are taking all the steps to improve our communication, including implementation of automated detection and mitigation systems that can react much more quickly than any human operator. We already have such systems in place for our smaller data centers and are actively testing their accuracy and efficacy before turning them on for larger PoPs.

Taking down an important transit provider globally is not an easy decision, and many cautious steps are to be taken before doing it. The Internet and its associated protocols are a community based on mutual trust. Any weak link in the chain will cause the entire chain to fail and requires collaboration and cooperation from all parties to make it successful.

We know how important it is to communicate on our status page. We heard from our customers and took the necessary steps to improve on our communication. Our support team is working on improvements in how we update our status page and reviewing the content for accuracy as well as transparency.

Building a resilient network

Even as CloudFlare has grown to become critical Internet infrastructure sitting in front of four million Internet-facing applications, investing in greater resiliency continues to be a work-in-progress. This is achieved through a combination of greater interconnection, automated mitigation, and increased failover capacity.

We fill our cache from the public Internet using multiple transit providers (such as Telia), and deliver traffic to local eyeball networks using transit providers and multiple peering relationships. To the extent possible, we maintain buffer capacity with our providers to allow them to bear the impact of potential failures on multiple other networks. Spreading out traffic across providers allows for diversity and reduces the impact of potential outages from our upstream providers. Even so, today’s incident impacted a significant fraction of traffic that relied on the Telia backbone.

alt

_Traffic switching from one provider to the other after we reroute_

Where possible, we try to failover traffic to a redundant provider or data center while keeping traffic within the same country.

BGP is the protocol used to route packets between autonomous networks on the Internet. While it is doing a great job at keeping interconnections alive, it has no mechanism built in to detect packet loss and performance issues on a path.

We have been working on building a mechanism (which augments BGP) to proactively detect packet loss and move traffic away from providers experiencing packet loss. Because this system is currently activated only for our most remote and smallest locations, it didn't trigger in this morning’s incident. We plan to extend the capability in the next 2 weeks to switch from a manual reaction to an automatic one in all our POPs. For example, in this screenshot, you can see our POP in Johannesburg being automatically removed from our network because of problems detected when connecting to origin servers:

alt
alt

Johannesburg PoP gracefully fails over to nearest PoP

Summary

We understand how critical our infrastructure is for our customers’ businesses, and so we will continue to move towards completely automated systems to deal with this type of incident. Our goal is to minimize disruptions and outages for our customers regardless of the origin of the issue.

Post MortemReliabilityTrafficBGPVulnerabilitiesSecurity

Related posts

May 18, 2026

Project Glasswing: what Mythos showed us

In recent weeks, we pointed Mythos and other security-focused LLMs at live code across critical parts of our infrastructure. We share what we observed, the models’ strengths and weaknesses, and what the work around them needs to look like before any of it can scale....

    By 

May 07, 2026

How Cloudflare responded to the “Copy Fail” Linux vulnerability

When a critical Linux kernel privilege escalation was publicly disclosed, Cloudflare's security and engineering teams detected, investigated, and mitigated the threat across our global fleet, confirming zero customer impact and no malicious exploitation....

    By