This is strictly a violation of the TCP specification

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026

Cloudflare Team · 2016-08-12 · via The Cloudflare Blog

I was asked to debug another weird issue on our network. Apparently every now and then a connection going through CloudFlare would time out with 522 HTTP error.

CC BY 2.0 image by Chris Combe

522 error on CloudFlare indicates a connection issue between our edge server and the origin server. Most often the blame is on the origin server side - the origin server is slow, offline or encountering high packet loss. Less often the problem is on our side.

In the case I was debugging it was neither. The internet connectivity between CloudFlare and origin was perfect. No packet loss, flat latency. So why did we see a 522 error?

The root cause of this issue was pretty complex. After a lot of debugging we identified an important symptom: sometimes, once in thousands of runs, our test program failed to establish a connection between two daemons on the same machine. To be precise, an NGINX instance was trying to establish a TCP connection to our internal acceleration service on localhost. This failed with a timeout error.

Once we knew what to look for we were able to reproduce this with good old netcat. After a couple of dozen of runs this is what we saw:

$ nc 127.0.0.1 5000  -v
nc: connect to 127.0.0.1 port 5000 (tcp) failed: Connection timed out

The view from strace:

socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(5000), sin_addr=inet_addr("127.0.0.1")}, 16) = -110 ETIMEDOUT

netcat calls connect() to establish a connection to localhost. This takes a long time and eventually fails with ETIMEDOUT error. Tcpdump confirms that connect() did send SYN packets over loopback but never received any SYN+ACKs:

$ sudo tcpdump -ni lo port 5000 -ttttt -S
00:00:02.405887 IP 127.0.0.12.59220 > 127.0.0.1.5000: Flags [S], seq 220451580, win 43690, options [mss 65495,sackOK,TS val 15971607 ecr 0,nop,wscale 7], length 0
00:00:03.406625 IP 127.0.0.12.59220 > 127.0.0.1.5000: Flags [S], seq 220451580, win 43690, options [mss 65495,sackOK,TS val 15971857 ecr 0,nop,wscale 7], length 0
... 5 more ...

Hold on. What just happened here?

Well, we called connect() to localhost and it timed out. The SYN packets went off over loopback to localhost but were never answered.

Loopback congestion

CC BY 2.0 image by akj1706

The first thought is about Internet stability. Maybe the SYN packets were lost? A little known fact is that it's not possible to have any packet loss or congestion on the loopback interface. The loopback works magically: when an application sends packets to it, it immediately, still within the send syscall handling, gets delivered to the appropriate target. There is no buffering over loopback. Calling send over loopback triggers iptables, network stack delivery mechanisms and delivers the packet to the appropriate queue of the target application. Assuming the target application has some space in its buffers, packet loss over loopback is not possible.

Maybe the listening application misbehaved?

Under normal circumstances connections to localhost are not supposed to time out. There is one corner case when this may happen though - when the listening application does not call accept() fast enough.

When that happens, the default behavior is to drop the new SYN packets. If the listening socket has a full accept queue, then new SYN packets will be dropped. The intention is to cause push-back, to slow down the rate of incoming connections. The peers should eventually re-send SYN packets, and hopefully by that time the accept queue will be freed. This behavior is controlled by the tcp_abort_on_overflow sysctl.

But this accept queue overflow did not happen in our case. Our listening application had an empty accept queue. We checked this with the ss command:

$ ss -n4lt 'sport = :5000'
State      Recv-Q Send-Q  Local Address:Port    Peer Address:Port
LISTEN     0      128                 *:5000               *:*

The Send-Q column shows the backlog / accept queue size given to listen() syscall - 128 in our case. The Recv-Q reports on the number of outstanding connections in the accept queue - zero.

The problem

To recap: we are establishing connections to localhost. Most of them work fine but sometimes the connect() syscall times out. The SYN packets are being sent over loopback. Because it's loopback they are being delivered to the listening socket. The listening socket accept queue is empty, but we see no SYN+ACKs.

Further investigation revealed something peculiar. We noticed hundreds of CLOSE_WAIT sockets:

$ ss -n4t | head
State      Recv-Q Send-Q  Local Address:Port    Peer Address:Port
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36599
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36467
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36154
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36412
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36536
...

What is CLOSE_WAIT anyway?

CC BY 2.0 image by DaveBleasdale

Citing the Red Hat docs:

CLOSE_WAIT - Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed. This means the socket is waiting for the application to execute close(). A socket can be in CLOSE_WAIT state indefinitely until the application closes it. Faulty scenarios would be like a file descriptor leak: server not executing close() on sockets leading to pile up of CLOSE_WAIT sockets.

This makes sense. Indeed, we were able to confirm the listening application leaks sockets. Hurray, good progress!

The leaking sockets don't explain everything though.

Usually a Linux process can open up to 1,024 file descriptors. If our application did run out of file descriptors the accept syscall would return the EMFILE error. If the application further mishandled this error case, this could result in losing incoming SYN packets. Failed accept calls will not dequeue a socket from accept queue, causing the accept queue to grow. The accept queue will not be drained and will eventually overflow. An overflowing accept queue could result in dropped SYN packets and failing connection attempts.

But this is not what happened here. Our application hasn't run out of file descriptors yet. This can be verified by counting file descriptors in /proc/<pid>/fd directory:

$ ls /proc/` pidof listener `/fd | wc -l
517

517 file descriptors are comfortably far from the 1,024 file descriptor limit. Also, we earlier showed with ss that the accept queue is empty. So why did our connections time out?

What really happens

The root cause of the problem is definitely our application leaking sockets. The symptoms though, the connection timing out, are still unexplained.

Time to raise the curtain of doubt. Here is what happens.

The listening application leaks sockets, they are stuck in CLOSE_WAIT TCP state forever. These sockets look like (127.0.0.1:5000, 127.0.0.1:some-port). The client socket at the other end of the connection is (127.0.0.1:some-port, 127.0.0.1:5000), and is properly closed and cleaned up.

When the client application quits, the (127.0.0.1:some-port, 127.0.0.1:5000) socket enters the FIN_WAIT_1 state and then quickly transitions to FIN_WAIT_2. The FIN_WAIT_2 state should move on to TIME_WAIT if the client received FIN packet, but this never happens. The FIN_WAIT_2 eventually times out. On Linux this is 60 seconds, controlled by net.ipv4.tcp_fin_timeout sysctl.

This is where the problem starts. The (127.0.0.1:5000, 127.0.0.1:some-port) socket is still in CLOSE_WAIT state, while (127.0.0.1:some-port, 127.0.0.1:5000) has been cleaned up and is ready to be reused. When this happens the result is a total mess. One part of the socket won't be able to advance from the SYN_SENT state, while the other part is stuck in CLOSE_WAIT. The SYN_SENT socket will eventually give up failing with ETIMEDOUT.

How to reproduce

It all starts with a listening application that leaks sockets and forgets to call close(). This kind of bug does happen in complex applications. An example buggy code is available here. When you run it nothing will happen initially. ss will show a usual listening socket:

$ go build listener.go && ./listener &
$ ss -n4tpl 'sport = :5000'
State      Recv-Q Send-Q  Local Address:Port    Peer Address:Port
LISTEN     0      128                 *:5000               *:*      users:(("listener",81425,3))

Then we have a client application. The client behaves correctly - it establishes a connection and after a while it closes it. We can demonstrate this with nc:

$ nc -4 localhost 5000 &
$ ss -n4tp '( dport = :5000 or sport = :5000 )'
State      Recv-Q Send-Q  Local Address:Port    Peer Address:Port
ESTAB      0      0           127.0.0.1:5000       127.0.0.1:36613  users:(("listener",81425,5))
ESTAB      0      0           127.0.0.1:36613      127.0.0.1:5000   users:(("nc",81456,3))

As you see above ss shows two TCP sockets, representing the two ends of the TCP connection. The client one is (127.0.0.1:36613, 127.0.0.1:5000), the server one (127.0.0.1:5000, 127.0.0.1:36613).

The next step is to gracefully close the client connection:

$ kill `pidof nc`

Now the connections enter TCP cleanup stages: FIN_WAIT_2 for the client connection, and CLOSE_WAIT for the server one (if you want to read more about these TCP states here's a recommended read):

$ ss -n4tp
State      Recv-Q Send-Q  Local Address:Port    Peer Address:Port
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36613  users:(("listener",81425,5))
FIN-WAIT-2 0      0           127.0.0.1:36613      127.0.0.1:5000

After a while FIN_WAIT_2 will expire:

$ ss -n4tp
State      Recv-Q Send-Q  Local Address:Port    Peer Address:Port
CLOSE-WAIT 1      0           127.0.0.1:5000       127.0.0.1:36613  users:(("listener",81425,5))

But the CLOSE_WAIT socket stays in! Since we have a leaked file descriptor in the listener program, the kernel is not allowed to move it to FIN_WAIT state. It is stuck in CLOSE_WAIT indefinitely. This stray CLOSE_WAIT would not be a problem if only the same port pair was never reused. Unfortunately, it happens and causes the problem.

To see this we need to launch hundreds of nc instances and hope the kernel will assign the colliding port number to one of them. The affected nc will be stuck in connect() for a while:

$ nc -v -4 localhost 5000 -w0
...

We can use the ss to confirm that the ports indeed collide:

SYN-SENT   0  1   127.0.0.1:36613      127.0.0.1:5000   users:(("nc",89908,3))
CLOSE-WAIT 1  0   127.0.0.1:5000       127.0.0.1:36613  users:(("listener",81425,5))

In our example the kernel allocated source address (127.0.0.1:36613) to the nc process. This TCP flow is okay to be used for a connection going to the listener application. But the listener will not be able to allocate a flow in reverse direction since (127.0.0.1:5000, 127.0.0.1:36613) from previous connections is still being used and remains with CLOSE_WAIT state.

The kernel gets confused. It retries the SYN packets, but will never respond since the other TCP socket is stick in the CLOSE_WAIT state. Eventually our affected netcat will die with unhappy ETIMEDOUT error message:

...
nc: connect to localhost port 5000 (tcp) failed: Connection timed out

If you want to reproduce this weird scenario consider running this script. It will greatly increase the probability of netcat hitting the conflicted socket:

$ for i in `seq 500`; do nc -v -4 -s 127.0.0.1 localhost 5000 -w0; done

A little known fact is that the source port automatically assigned by the kernel is incremental, unless you select the source IP manually. In such case the source port is random. This bash script will create a minefield of CLOSE_WAIT sockets randomly distributed across the ephemeral port range.

Final words

If there's a moral from the story it's to watch out for CLOSE_WAIT sockets. Their presence indicate leaking sockets, and with leaking sockets some incoming connections may time out. Presence of many FIN_WAIT_2 sockets says the problem is not on current machine but on the remote end of the connection.

Furthermore, this bug shows that it is possible for the states of the two ends of a TCP connection to be at odds, even if the connection is over the loopback interface.

It seems that the design decisions made by the BSD Socket API have unexpected long lasting consequences. If you think about it - why exactly the socket can automatically expire the FIN_WAIT state, but can't move off from CLOSE_WAIT after some grace time. This is very confusing... And it should be! The original TCP specification does not allow automatic state transition after FIN_WAIT_2 state! According to the spec FIN_WAIT_2 is supposed to stay running until the application on the other side cleans up.

Let me leave you with the tcp(7) manpage describing the tcp_fin_timeout setting:

tcp_fin_timeout (integer; default: 60)
      This specifies how many seconds to wait for a final FIN packet
      before the socket is forcibly closed.  This is strictly a
      violation of the TCP specification, but required to prevent
      denial-of-service attacks.

I think now we understand why automatically closing FIN_WAIT_2 is strictly speaking a violation of the TCP specification.

Do you enjoy playing with low level networking bits? Are you interested in dealing with some of the largest DDoS attacks ever seen?

If so you should definitely have a look at the open positions in our London, San Francisco, Singapore, Champaign (IL) and Austin (TX) offices!

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Cloudflare Blog

Loopback congestion

Maybe the listening application misbehaved?

The problem

What is CLOSE_WAIT anyway?

What really happens

How to reproduce

Final words