惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
E
Exploit-DB.com RSS Feed
Google DeepMind News
Google DeepMind News
F
Full Disclosure
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
ThreatConnect
Stack Overflow Blog
Stack Overflow Blog
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
G
GRAHAM CLULEY
C
Check Point Blog
T
Threatpost
I
Intezer
Spread Privacy
Spread Privacy
The Register - Security
The Register - Security
Project Zero
Project Zero
月光博客
月光博客
人人都是产品经理
人人都是产品经理
阮一峰的网络日志
阮一峰的网络日志
D
DataBreaches.Net
IT之家
IT之家
Malwarebytes
Malwarebytes
T
The Blog of Author Tim Ferriss
P
Privacy International News Feed
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
量子位
李成银的技术随笔
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cisco Talos Blog
Cisco Talos Blog
Know Your Adversary
Know Your Adversary
美团技术团队
The GitHub Blog
The GitHub Blog
T
Tor Project blog
M
MIT News - Artificial intelligence
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Comments on: Blog
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
NISL@THU
NISL@THU
The Cloudflare Blog
H
Help Net Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026
io_submit: The epoll alternative you've never heard about
Cloudflare Team · 2019-01-04 · via The Cloudflare Blog

2019-01-04

6 min read

My curiosity was piqued by an LWN article about IOCB_CMD_POLL - A new kernel polling interface. It discusses an addition of a new polling mechanism to Linux AIO API, which was merged in 4.18 kernel. The whole idea is rather intriguing. The author of the patch is proposing to use the Linux AIO API with things like network sockets.

Hold on. The Linux AIO is designed for, well, Asynchronous disk IO! Disk files are not the same thing as network sockets! Is it even possible to use the Linux AIO API with network sockets in the first place?

The answer turns out to be a strong YES! In this article I'll explain how to use the strengths of Linux AIO API to write better and faster network servers.

But before we start, what is Linux AIO anyway?

Photo by Scott Schiller CC/BY/2.0

Introduction to Linux AIO

Linux AIO exposes asynchronous disk IO to userspace software.

Historically on Linux, all disk operations were blocking. Whether you did open(), read(), write() or fsync(), you could be sure your thread would stall if the needed data and meta-data was not ready in disk cache. This usually isn't a problem. If you do small amount of IO or have plenty of memory, the disk syscalls would gradually fill the cache and on average be rather fast.

The IO operation performance drops for IO-heavy workloads, like databases or caching web proxies. In such applications it would be tragic if a whole server stalled, just because some odd read() syscall had to wait for disk.

To work around this problem, applications use one of the three approaches:

(1) Use thread pools and offload blocking syscalls to worker threads. This is what glibc POSIX AIO (not to be confused with Linux AIO) wrapper does. (See: IBM's documentation). This is also what we ended up doing in our application at Cloudflare - we offloaded read() and open() calls to a thread pool.

(2) Pre-warm the disk cache with posix_fadvise(2) and hope for the best.

(3) Use Linux AIO with XFS file system, file opened with O_DIRECT, and avoid the undocumented pitfalls.

None of these methods is perfect. Even the Linux AIO if used carelessly, could still block in the io_submit() call. This was recently mentioned in another LWN article:

The Linux asynchronous I/O (AIO) layer tends to have many critics and few defenders, but most people at least expect it to actually be asynchronous. In truth, an AIO operation can block in the kernel for a number of reasons, making AIO difficult to use in situations where the calling thread truly cannot afford to block.

Now that we know what Linux AIO API doesn't do well, let's see where it shines.

Simplest Linux AIO program

To use Linux AIO you first need to define the 5 needed syscalls - glibc doesn't provide wrapper functions. To use Linux AIO we need:

(1) First call io_setup() to set up the aio_context data structure. Kernel will hand us an opaque pointer.

(2) Then we can call io_submit() to submit a vector of "I/O control blocks" struct iocb for processing.

(3) Finally, we can call io_getevents() to block and wait for a vector of struct io_event - completion notification of the iocb's.

There are 8 commands that can be submitted in an iocb. Two read, two write, two fsync variants and a POLL command introduced in 4.18 Kernel:

IOCB_CMD_PREAD = 0,
IOCB_CMD_PWRITE = 1,
IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,
IOCB_CMD_POLL = 5,   /* from 4.18 */
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,

The struct iocb passed to io_submit is large, and tuned for disk IO. Here's a simplified version:

struct iocb {
  __u64 data;           /* user data */
  ...
  __u16 aio_lio_opcode; /* see IOCB_CMD_ above */
  ...
  __u32 aio_fildes;     /* file descriptor */
  __u64 aio_buf;        /* pointer to buffer */
  __u64 aio_nbytes;     /* buffer size */
...
}

The completion notification retrieved from io_getevents:

struct io_event {
  __u64  data;  /* user data */
  __u64  obj;   /* pointer to request iocb */
  __s64  res;   /* result code for this event */
  __s64  res2;  /* secondary result */
};

Let's try an example. Here's the simplest program reading /etc/passwd file with Linux AIO API:

fd = open("/etc/passwd", O_RDONLY);

aio_context_t ctx = 0;
r = io_setup(128, &ctx);

char buf[4096];
struct iocb cb = {.aio_fildes = fd,
                  .aio_lio_opcode = IOCB_CMD_PREAD,
                  .aio_buf = (uint64_t)buf,
                  .aio_nbytes = sizeof(buf)};
struct iocb *list_of_iocb[1] = {&cb};

r = io_submit(ctx, 1, list_of_iocb);

struct io_event events[1] = {{0}};
r = io_getevents(ctx, 1, 1, events, NULL);

bytes_read = events[0].res;
printf("read %lld bytes from /etc/passwd\n", bytes_read);

Full source is, as usual, on GitHub. Here's a strace of this program:

openat(AT_FDCWD, "/etc/passwd", O_RDONLY)
io_setup(128, [0x7f4fd60ea000])
io_submit(0x7f4fd60ea000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7ffc5ff703d0, aio_nbytes=4096, aio_offset=0}])
io_getevents(0x7f4fd60ea000, 1, 1, [{data=0, obj=0x7ffc5ff70390, res=2494, res2=0}], NULL)

This all worked fine! But the disk read was not asynchronous: the io_submit syscall blocked and did all the work! The io_getevents call finished instantly. We could try to make the disk read async, but this requires O_DIRECT flag which skips the caches.

Let's try to better illustrate the blocking nature of io_submit on normal files. Here's similar example, showing strace when reading large 1GiB block from /dev/zero:

io_submit(0x7fe1e800a000, 1, [{aio_lio_opcode=IOCB_CMD_PREAD, aio_fildes=3, aio_buf=0x7fe1a79f4000, aio_nbytes=1073741824, aio_offset=0}]) \
    = 1 <0.738380>
io_getevents(0x7fe1e800a000, 1, 1, [{data=0, obj=0x7fffb9588910, res=1073741824, res2=0}], NULL) \
    = 1 <0.000015>

The kernel spent 738ms in io_submit and only 15us in io_getevents. The kernel behaves the same way with network sockets - all the work is done in io_submit.

Photo by Helix84 CC/BY-SA/3.0

Linux AIO with sockets - batching

The implementation of io_submit is rather conservative. Unless the passed descriptor is O_DIRECT file, it will just block and perform the requested action. In case of network sockets it means:

  • For blocking sockets IOCB_CMD_PREAD will hang until a packet arrives.

  • For non-blocking sockets IOCB_CMD_PREAD will -11 (EAGAIN) return code.

These are exactly the same semantics as for vanilla read() syscall. It's fair to say that for network sockets io_submit is no smarter than good old read/write calls.

It's important to note the iocb requests passed to kernel are evaluated in-order sequentially.

While Linux AIO won't help with async operations, it can definitely be used for syscall batching!

If you have a web server needing to send and receive data from hundreds of network sockets, using io_submit can be a great idea. This would avoid having to call send and recv hundreds of times. This will improve performance - jumping back and forth from userspace to kernel is not free, especially since the meltdown and spectre mitigations.

One buffer

Multiple buffers

One file descriptor

read()

readv()

Many file descriptors

io_submit + IOCB_CMD_PREAD

io_submit + IOCB_CMD_PREADV

To illustrate the batching aspect of io_submit, let's create a small program forwarding data from one TCP socket to another. In simplest form, without Linux AIO, the program would have a trivial flow like this:

while True:
  d = sd1.read(4096)
  sd2.write(d)

We can express the same logic with Linux AIO. The code will look like this:

struct iocb cb[2] = {{.aio_fildes = sd2,
                      .aio_lio_opcode = IOCB_CMD_PWRITE,
                      .aio_buf = (uint64_t)&buf[0],
                      .aio_nbytes = 0},
                     {.aio_fildes = sd1,
                     .aio_lio_opcode = IOCB_CMD_PREAD,
                     .aio_buf = (uint64_t)&buf[0],
                     .aio_nbytes = BUF_SZ}};
struct iocb *list_of_iocb[2] = {&cb[0], &cb[1]};
while(1) {
  r = io_submit(ctx, 2, list_of_iocb);

  struct io_event events[2] = {};
  r = io_getevents(ctx, 2, 2, events, NULL);
  cb[0].aio_nbytes = events[1].res;
}

This code submits two jobs to io_submit. First, request to write data to sd2 then to read data from sd1. After the read is done, the code fixes up the write buffer size and loops again. The code does a cool trick - the first write is of size 0. We are doing so since we can fuse write+read in one io_submit (but not read+write). After a read is done we have to fix the write buffer size.

Is this code faster than the simple read/write version? Not yet. Both versions have two syscalls: read+write and io_submit+io_getevents. Fortunately, we can improve it.

Getting rid of io_getevents

When running io_setup(), the kernel allocates a couple of pages of memory for the process. This is how this memory block looks like in /proc//maps:

marek:~$ cat /proc/`pidof -s aio_passwd`/maps
...
7f7db8f60000-7f7db8f63000 rw-s 00000000 00:12 2314562     /[aio] (deleted)
...

The [aio] memory region (12KiB in my case) was allocated by the io_setup. This memory range is used a ring buffer storing the completion events. In most cases, there isn't any reason to call the real io_getevents syscall. The completion data can be easily retrieved from the ring buffer without the need of consulting the kernel. Here is a fixed version of the code:

int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
                 struct io_event *events, struct timespec *timeout)
{
    int i = 0;

    struct aio_ring *ring = (struct aio_ring*)ctx;
    if (ring == NULL || ring->magic != AIO_RING_MAGIC) {
        goto do_syscall;
    }

    while (i < max_nr) {
        unsigned head = ring->head;
        if (head == ring->tail) {
            /* There are no more completions */
            break;
        } else {
            /* There is another completion to reap */
            events[i] = ring->events[head];
            read_barrier();
            ring->head = (head + 1) % ring->nr;
            i++;
        }
    }

    if (i == 0 && timeout != NULL && timeout->tv_sec == 0 && timeout->tv_nsec == 0) {
        /* Requested non blocking operation. */
        return 0;
    }

    if (i && i >= min_nr) {
        return i;
    }

do_syscall:
    return syscall(__NR_io_getevents, ctx, min_nr-i, max_nr-i, &events[i], timeout);
}

Here's full code. This ring buffer interface is poorly documented. I adapted this code from the axboe/fio project.

With this code fixing the io_getevents function, our Linux AIO version of the TCP proxy needs only one syscall per loop, and indeed is a tiny bit faster than the read+write code.

Photo by Train Photos CC/BY-SA/2.0

Epoll alternative

With the addition of IOCB_CMD_POLL in kernel 4.18, one could use io_submit also as select/poll/epoll equivalent. For example, here's some code waiting for data on a socket:

struct iocb cb = {.aio_fildes = sd,
                  .aio_lio_opcode = IOCB_CMD_POLL,
                  .aio_buf = POLLIN};
struct iocb *list_of_iocb[1] = {&cb};

r = io_submit(ctx, 1, list_of_iocb);
r = io_getevents(ctx, 1, 1, events, NULL);

Full code. Here's the strace view:

io_submit(0x7fe44bddd000, 1, [{aio_lio_opcode=IOCB_CMD_POLL, aio_fildes=3}]) \
    = 1 <0.000015>
io_getevents(0x7fe44bddd000, 1, 1, [{data=0, obj=0x7ffef65c11a8, res=1, res2=0}], NULL) \
    = 1 <1.000377>

As you can see this time the "async" part worked fine, the io_submit finished instantly and the io_getevents successfully blocked for 1s while awaiting data. This is pretty powerful and can be used instead of the epoll_wait() syscall.

Furthermore, normally dealing with epoll requires juggling epoll_ctl syscalls. Application developers go to great lengths to avoid calling this syscall too often. Just read the man page on EPOLLONESHOT and EPOLLET flags. Using io_submit for polling works around this whole complexity, and doesn't require any spurious syscalls. Just push your sockets to the iocb request vector, call io_submit exactly once and wait for completions. The API can't be simpler than this.

Summary

In this blog post we reviewed the Linux AIO API. While initially conceived to be a disk-only API, it seems to be working in the same way as normal read/write syscalls on network sockets. But as opposed to read/write io_submit allows syscall batching, potentially improving performance.

Since kernel 4.18 io_submit and io_getevents can be used to wait for events like POLLIN and POLLOUT on network sockets. This is great, and could be used as a replacement for epoll() in the event loop.

I can imagine a network server that could just be doing io_submit and io_getevents syscalls, as opposed to the usual mix of read, write, epoll_ctl and epoll_wait. With such design the syscall batching aspect of io_submit could really shine. Such a server would be meaningfully faster.

Sadly, even with recent Linux AIO API improvements, the larger discussion remains. Famously, Linus hates it:

AIO is a horrible ad-hoc design, with the main excuse being "other, less gifted people, made that design, and we are implementing it for compatibility because database people - who seldom have any shred of taste - actually use it". But AIO was always really really ugly.

Over the years there had been multiple attempts on creating a better batching and async interfaces, unfortunately, lacking coherent vision. For example, recent addition of sendto(MSG_ZEROCOPY) allows for truly async transmission operations, but no batching. io_submit allows batching but not async. It's even worse than that - Linux currently has three ways of delivering async notifications - signals, io_getevents and MSG_ERRQUEUE.

Having said that I'm really excited to see the new developments which allow developing faster network servers. I'm jumping on the code to replace my rusty epoll event loops with io_submit!

LinuxAPIDevelopers