Why does one NGINX worker take all the load?

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026

Cloudflare Team · 2017-10-23 · via The Cloudflare Blog

Scaling up TCP servers is usually straightforward. Most deployments start by using a single process setup. When the need arises more worker processes are added. This is a scalability model for many applications, including HTTP servers like Apache, NGINX or Lighttpd.

CC BY-SA 2.0 image by Paul Townsend

Increasing the number of worker processes is a great way to overcome a single CPU core bottleneck, but opens a whole new set of problems.

There are generally three ways of designing a TCP server with regard to performance:

(a) Single listen socket, single worker process.

(b) Single listen socket, multiple worker processes.

(a) Single listen socket, single worker process This is the simplest model, where processing is limited to a single CPU. A single worker process is doing both accept() calls to receive the new connections and processing of the requests themselves. This model is the preferred Lighttpd setup.

(b) Single listen socket, multiple worker process The new connections sit in a single kernel data structure (the listen socket). Multiple worker processes are doing both the accept() calls and processing of the requests. This model enables some spreading of the inbound connections across multiple CPUs. This is the standard model for NGINX.

(c) Separate listen socket for each worker process By using the SO_REUSEPORT socket option it's possible to create a dedicated kernel data structure (the listen socket) for each worker process. This can avoid listen socket contention, which isn't really an issue unless you run at Google scale. It can also help in better balancing the load. More on that later.

At Cloudflare we run NGINX, and we are most familiar with the (b) model. In this blog post we'll describe a specific problem with this model, but let's start from the beginning.

Spreading the accept() load

Not many people realize that there are two different ways of spreading the accept() new connection load across multiple processes. Consider these two code snippets. Let's call the first one blocking-accept. It's best described with this pseudo code:

sd = bind(('127.0.0.1', 1024))
for i in range(3):
    if os.fork () == 0:
        while True:
            cd, _ = sd.accept()
            cd.close()
            print 'worker %d' % (i,)

The idea is to share an accept queue across processes, by calling blocking accept() from multiple workers concurrently.

The second model should be called epoll-and-accept:

sd = bind(('127.0.0.1', 1024))
sd.setblocking(False)
for i in range(3):
    if os.fork () == 0:
        ed = select.epoll()
        ed.register(sd, EPOLLIN | EPOLLEXCLUSIVE)
        while True:
            ed.poll()
            cd, _ = sd.accept()
            cd.close()
            print 'worker %d' % (i,)

The intention is to have a dedicated epoll in each worker process. The worker will call non-blocking accept() only when epoll reports new connections. We can avoid the usual thundering-herd issue by using the EPOLLEXCLUSIVE flag.

(Full code is available here)

While these programs look similar, their behavior differs subtly[1] [2]. Let's see what happens when we establish a couple of connections to each:

$ ./blocking-accept.py &
$ for i in `seq 6`; do nc localhost 1024; done
worker 2
worker 1
worker 0
worker 2
worker 1
worker 0

$ ./epoll-and-accept.py &
$ for i in `seq 6`; do nc localhost 1024; done
worker 0
worker 0
worker 0
worker 0
worker 0
worker 0

The blocking-accept model distributed connections across all workers - each got exactly 2 connections. The epoll-and-accept model on the other hand forwarded all the connections to the first worker. The remaining workers got no traffic whatsoever.

It might catch you by surprise, but Linux does different load balancing in both cases.

In the first one Linux will do proper FIFO-like round robin load balancing. Each process waiting on accept() is added to a queue and they will be served connections in order.

In the epoll-and-accept the load balancing algorithm differs: Linux seems to choose the last added process, a LIFO-like behavior. The process added to the waiting queue most recently will get the new connection. This behavior causes the busiest process, the one that only just went back to event loop, to receive the majority of the new connections. Therefore, the busiest worker is likely to get most of the load.

In fact, this is what we see in NGINX. Here's a dump of one of our synthetic tests where one worker is taking most of the load, while others are relatively underutilized:

Notice the last worker got almost no load, while the busiest is using 30% of CPU.

SO_REUSEPORT to the rescue

Linux supports a feature to work around this balancing problem - the SO_REUSEPORT socket option. We explained this in the (c) model, where the incoming connections are split into multiple separate accept queues. Usually it's one dedicated queue for each worker process.

Since the accept queues are not shared, and Linux spreads the load by a simple hashing logic, each worker will get statistically the same number of incoming connections. This results in much better balancing of the load. Each worker gets roughly a similar amount of traffic to handle:

Here all the workers are handling the work, with the busiest at 13.2% CPU while the least busy uses 9.3%. Much better load distribution than before.

This is better, however the balancing of the load is not the end of the story. Splitting the accept queue worsens the latency distribution in some circumstances! It's best explained by The Engineer guy:

I call this problem - the Waitrose vs Tesco Superstore cashiers. The Waitrose "combined queue model" is better at reducing the maximum latency. A single clogged cashier will not significantly affect the latency of whole system. The remainder of the load will be spread across the other less busy cashiers. The shared queue will be drained relatively promptly. On the other hand the Tesco Superstore model - of separate queues to each cashier - will suffer from the large latency issue. If a single cashier gets blocked, all the traffic waiting in its queue will stall. The maximum latency will grow if any single queue gets stuck.

In a case of increased load the (b) single accept queue model, while is not balancing the load evenly, is better for latency. We can show this by running another synthetic benchmark. Here is the latency distribution for 100k relatively CPU-intensive HTTP requests, with HTTP keepalives disabled, concurrency set to 200 and running against a single-queue (b) NGINX:

$ ./benchhttp -n 100000 -c 200 -r target:8181 http://a.a/
        | cut -d " " -f 1
        | ./mmhistogram -t "Duration in ms (single queue)"
min:3.61 avg:30.39 med=30.28 max:72.65 dev:1.58 count:100000
Duration in ms (single queue):
 value |-------------------------------------------------- count
     0 |                                                   0
     1 |                                                   0
     2 |                                                   1
     4 |                                                   16
     8 |                                                   67
    16 |************************************************** 91760
    32 |                                              **** 8155
    64 |                                                   1

As you can see the latency is very predictable. The median is almost equal to the average and standard deviation is small.

Here is the same test run against the SO_REUSEPORT multi-queue NGINX setup (c):

$ ./benchhttp -n 100000 -c 200 -r target:8181 http://a.a/
        | cut -d " " -f 1
        | ./mmhistogram -t "Duration in ms (multiple queues)"
min:1.49 avg:31.37 med=24.67 max:144.55 dev:25.27 count:100000
Duration in ms (multiple queues):
 value |-------------------------------------------------- count
     0 |                                                   0
     1 |                                                 * 1023
     2 |                                         ********* 5321
     4 |                                 ***************** 9986
     8 |                  ******************************** 18443
    16 |    ********************************************** 25852
    32 |************************************************** 27949
    64 |                              ******************** 11368
   128 |                                                   58

The average is comparable, the median dropped, the max value significantly increased, and most importantly the deviation is now gigantic!

The latency distribution is all over the place - it's not something you want to have on a production server.

(Instructions to reproduce the test)

Take this benchmark with a grain of salt though. We try to generate substantial load in order to prove the point. Depending on your setup it might be possible to shield your server from excessive traffic and prevent it from entering this degraded-latency state[3].

Conclusion

Balancing the incoming connections across multiple application workers is far from a solved problem. The single queue approach (b) scales well and keeps the max latency in check, but due to the epoll LIFO behavior the worker processes won't be evenly load balanced.

For workloads that require even balancing it might be beneficial to use the SO_REUSEPORT pattern (c). Unfortunately in high load situations the latency distribution might degrade.

The best generic solution seem to be to change the standard epoll behavior from LIFO to FIFO. There have been attempts to address this in the past by Jason Baron from Akamai (1, 2, 3), but none had landed in mainline so far.

Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

Of course comparing blocking accept() with a full featured epoll() event loop is not fair. Epoll is more powerful and allows us to create rich event driven programs. Using blocking accept is rather cumbersome or just not useful at all. To make any sense, blocking accept programs would require careful multi-threading programming, with a dedicated thread per request. ↩︎
Another surprise lurking in the corner - using blocking accept() on Linux is technically incorrect! Alan Burlison pointed out that calling close() on listen socket that has blocking accepts() will not interrupt them. This can result in a buggy behavior - you may get a successful accept() on a listen socket that no longer exists. When in doubt - avoid using blocking accept() in multithreaded programs. The workaround is to call shutdown() first, but this is not POSIX compliant. It's a mess. ↩︎
There are a couple of things you need to take into account when using NGINX reuseport implementation. First make sure you run 1.13.6+ or have this patch applied. Then, remember that due to a defect in Linux TCP REUSEPORT implementation reducing number of REUSEPORT queues will cause some waiting TCP connections to be dropped. ↩︎

此内容由惯性聚合(RSS阅读器)自动聚合整理，仅供阅读参考。原文来自 — 版权归原作者所有。

推荐订阅源

The Cloudflare Blog

Spreading the accept() load

SO_REUSEPORT to the rescue

Conclusion