惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

爱范儿
爱范儿
E
Exploit-DB.com RSS Feed
Google DeepMind News
Google DeepMind News
F
Full Disclosure
D
Darknet – Hacking Tools, Hacker News & Cyber Security
T
ThreatConnect
Stack Overflow Blog
Stack Overflow Blog
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
G
GRAHAM CLULEY
C
Check Point Blog
T
Threatpost
I
Intezer
Spread Privacy
Spread Privacy
The Register - Security
The Register - Security
Project Zero
Project Zero
月光博客
月光博客
人人都是产品经理
人人都是产品经理
阮一峰的网络日志
阮一峰的网络日志
D
DataBreaches.Net
IT之家
IT之家
Malwarebytes
Malwarebytes
T
The Blog of Author Tim Ferriss
P
Privacy International News Feed
P
Palo Alto Networks Blog
T
The Exploit Database - CXSecurity.com
量子位
李成银的技术随笔
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
Cisco Talos Blog
Cisco Talos Blog
Know Your Adversary
Know Your Adversary
美团技术团队
The GitHub Blog
The GitHub Blog
T
Tor Project blog
M
MIT News - Artificial intelligence
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Google Online Security Blog
Google Online Security Blog
P
Proofpoint News Feed
有赞技术团队
有赞技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
博客园 - 司徒正美
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
C
Comments on: Blog
T
Threat Research - Cisco Blogs
aimingoo的专栏
aimingoo的专栏
Security Latest
Security Latest
NISL@THU
NISL@THU
The Cloudflare Blog
H
Help Net Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main

The Cloudflare Blog

The day my ping took countermeasures Announcing Claude Compliance API support with Cloudflare CASB Announcing Claude Managed Agents on Cloudflare Project Glasswing: what Mythos showed us Our billing pipeline was suddenly slow. The culprit was a hidden bottleneck in ClickHouse Browser Run: now running on Cloudflare Containers, it’s faster and more scalable When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug Building For The Future How Cloudflare responded to the “Copy Fail” Linux vulnerability When DNSSEC goes wrong: how we responded to the .de TLD outage Code Orange: Fail Small is complete. The result is a stronger Cloudflare network Introducing Dynamic Workflows: durable execution that follows the tenant Post-quantum encryption for Cloudflare IPsec is generally available Agents can now create Cloudflare accounts, buy domains, and deploy Shutdowns, power outages, and conflict: a review of Q1 2026 Internet disruptions Making Rust Workers reliable: panic and abort recovery in wasm‑bindgen Moving past bots vs. humans Building the agentic cloud: everything we launched during Agents Week 2026 The AI engineering stack we built internally — on the platform we ship Orchestrating AI Code Review at scale Introducing the Agent Readiness score. Check to see if your site is agent-ready Shared Dictionaries: compression that keeps up with the agentic web Redirects for AI Training enforces canonical content Unweight: how we compressed an LLM 22% without sacrificing quality Agents that remember: introducing Agent Memory Agents Week: network performance update Introducing Flagship: feature flags built for the age of AI Cloudflare’s AI Platform: an inference layer designed for agents Building the foundation for running extra-large language models AI Search: the search primitive for your agents Deploy Postgres and MySQL databases with PlanetScale + Workers Artifacts: versioned storage that speaks Git Email for agents - Cloudflare Email Service now in public beta Project Think: building the next generation of AI agents on Cloudflare Introducing Agent Lee - a new interface to the Cloudflare stack Register domains wherever you build: Cloudflare Registrar API now in beta Browser Run: give your agents a browser Rearchitecting the Workflows control plane for the agentic era Add voice to your agent Managed OAuth for Access: make internal apps agent-ready in one click Securing non-human identities: automated revocation, OAuth, and scoped permissions Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP Secure private networking for everyone: users, nodes, agents, Workers — introducing Cloudflare Mesh Building a CLI for all of Cloudflare Durable Objects in Dynamic Workers: Give each AI-generated app its own database Agents have their own computers with Sandboxes GA Dynamic, identity-aware, and secure Sandbox auth Welcome to Agents Week 500 Tbps of capacity: 16 years of scaling our global network From bytecode to bytes- automated magic packet generation Cloudflare targets 2029 for full post-quantum security How we built Organizations to help enterprises manage Cloudflare at scale Why we're rethinking cache for the AI era Our ongoing commitment to privacy for the 1.1.1.1 public DNS resolver Introducing EmDash — the spiritual successor to WordPress that solves plugin security Introducing Programmable Flow Protection: custom DDoS mitigation logic for Magic Transit customers Cloudflare Client-Side Security: smarter detection, now open to everyone How we use Abstract Syntax Trees (ASTs) to turn Workflows code into visual diagrams A one-line Kubernetes fix that saved 600 hours a year Sandboxing AI agents, 100x faster Inside Gen 13- how we built our most powerful server yet Launching Cloudflare’s Gen 13 servers- trading cache for cores for 2x edge compute performance Powering the agents: Workers AI now runs large models, starting with Kimi K2.5 Introducing Custom Regions for precision data control Standing up for the open Internet- why we appealed Italy’s Piracy Shield fine From legacy architecture to Cloudflare One Announcing Cloudflare Account Abuse Protection: prevent fraudulent attacks from bots and humans Slashing agent token costs by 98% with RFC 9457-compliant error responses AI Security for Apps is now generally available Building a security overview dashboard for actionable insights Investigating multi-vector attacks in Log Explorer Translating risk insights into actionable protection: leveling up security posture with Cloudflare and Mastercard Fixing request smuggling vulnerabilities in Pingora OSS deployments Active defense: introducing a stateful vulnerability scanner for APIs Complexity is a choice. SASE migrations shouldn’t take years. From the endpoint to the prompt: a unified data security vision in Cloudflare One Ending the "silent drop": how Dynamic Path MTU Discovery makes the Cloudflare One Client more resilient A QUICker SASE client: re-building Proxy Mode How Automatic Return Routing solves IP overlap Always-on detections: eliminating the WAF “log versus block” trade-off Mind the gap: new tools for continuous enforcement from boot to login Stop reacting to breaches and start preventing them with User Risk Scoring Defeating the deepfake: stopping laptop farms and insider threats Moving from license plates to badges: the Gateway Authorization Proxy Evolving Cloudflare’s Threat Intelligence Platform: actionable, scalable, and ETL-less Introducing the 2026 Cloudflare Threat Report See risk, fix risk: introducing Remediation in Cloudflare CASB How Cloudy translates complex security into human action From reactive to proactive: closing the phishing gap with LLMs Modernizing with agile SASE: a Cloudflare One blog takeover Beyond the blank slate: how Cloudflare accelerates your Zero Trust journey The truly programmable SASE platform Toxic combinations: when small signals add up to a security incident We deserve a better streams API for JavaScript The most-seen UI on the Internet? Redesigning Turnstile and Challenge Pages ASPA: making Internet routing more secure Bringing more transparency to post-quantum usage, encrypted messaging, and routing security How we rebuilt Next.js with AI in one week Cloudflare One is the first SASE offering modern post-quantum encryption across the full platform Cloudflare outage on February 20, 2026
SOCKMAP - TCP splicing of the future
Cloudflare Team · 2019-02-18 · via The Cloudflare Blog

2019-02-18

7 min read

Recently we stumbled upon the holy grail for reverse proxies - a TCP socket splicing API. This caught our attention because, as you may know, we run a global network of reverse proxy services. Proper TCP socket splicing reduces the load on userspace processes and enables more efficient data forwarding. We realized that Linux Kernel's SOCKMAP infrastructure can be reused for this purpose. SOCKMAP is a very promising API and is likely to cause a tectonic shift in the architecture of data-heavy applications like software proxies.

31958194737_e06ecd6fcc_o

Image by Mustad Marine public domain

But let’s rewind a bit.

Birthing pains of L7 proxies

Transmitting large amounts of data from userspace is inefficient. Linux provides a couple of specialized syscalls that aim to address this problem. For example, the sendfile(2) syscall (which Linus doesn't like) can be used to speed up transferring large files from disk to a socket. Then there is splice(2) which traditional proxies use to forward data between two TCP sockets. Finally, vmsplice can be used to stick memory buffer into a pipe without copying, but is very hard to use correctly.

Sadly, sendfile, splice and vmsplice are very specialized, synchronous and solve only one part of the problem - they avoid copying the data to userspace. They leave other efficiency issues unaddressed.

between

avoid user-space memory

zerocopy

sendfile

disk file --> socket

yes

no

splice

pipe <--> socket

yes

yes?

vmsplice

memory region --> pipe

no

yes

Processes that forward large amounts of data face three problems:

  1. Syscall cost: making multiple syscalls for every forwarded packet is costly.

  2. Wakeup latency: the user-space process must be woken up often to forward the data. Depending on the scheduler, this may result in poor tail latency.

  3. Copying cost: copying data from kernel to userspace and then immediately back to the kernel is not free and adds up to a measurable cost.

Many tried

Forwarding data between TCP sockets is a common practice. It's needed for:

  • Transparent forward HTTP proxies, like Squid.

  • Reverse caching HTTP proxies, like Varnish or NGINX.

  • Load balancers, like HAProxy, Pen or Relayd.

Over the years there have been many attempts to reduce the cost of dumb data forwarding between TCP sockets on Linux. This issue is generally called “TCP splicing”, “L7 splicing”, or “Socket splicing”.

Let’s compare the usual ways of doing TCP splicing. To simplify the problem, instead of writing a rich Layer 7 TCP proxy, we'll write a trivial TCP echo server.

It's not a joke. An echo server can illustrate TCP socket splicing well. You know - "echo" basically splices the socket… with itself!

Naive: read write loop

The naive TCP echo server would look like:

while data:
    data = read(sd, 4096)
    writeall(sd, data)

Nothing simpler. On a blocking socket this is a totally valid program, and will work just fine. For completeness, I prepared full code here.

Splice: specialized syscall

Linux has an amazing splice(2) syscall. It can tell the kernel to move data between a TCP buffer on a socket and a buffer on a pipe. The data remains in the buffers, on the kernel side. This solves the problem of needlessly having to copy the data between userspace and kernel-space. With the SPLICE_F_MOVE flag the kernel may be able to avoid copying the data at all!

Our program using splice() looks like:

pipe_rd, pipe_wr = pipe()
fcntl(pipe_rd, F_SETPIPE_SZ, 4096);

while n:
    n = splice(sd, pipe_wr, 4096)
    splice(pipe_rd, sd, n)

We still need wake up the userspace program and make two syscalls to forward any piece of data, but at least we avoid all the copying. Full source.

io_submit: Using Linux AIO API

In a previous blog post about io_submit() we proposed using the AIO interface with network sockets. Read the blog post for details, but here is the prepared program that has the echo server loop implemented with only a single syscall.

452423494_31aa5caca5_z-1

Image by jrsnchzhrs By-Nd 2.0

SOCKMAP: The ultimate weapon

In recent years Linux Kernel introduced an eBPF virtual machine. With it, user-space programs can run specialized, non-turing-complete bytecode in the kernel context. Nowadays, it's possible to select eBPF programs for dozens of use cases, ranging from packet filtering, to policy enforcement.

From Kernel 4.14 Linux got new eBPF machinery that can be used for socket splicing - SOCKMAP. It was created by John Fastabend at Cilium.io, exposing the Strparser interface to eBPF programs. Cilium uses SOCKMAP for Layer 7 policy enforcement, and all the logic it uses is embedded in an eBPF program. The API is not well documented, requires root and, from our experience, is slightly buggy. But it's very promising. Read more:

This is how to use SOCKMAP: SOCKMAP or specifically "BPF_MAP_TYPE_SOCKMAP", is a type of eBPF map. This map is an "array" - indices are integers. All this is pretty standard. The magic is in the map values - they must be TCP socket descriptors.

This map is very special - it has two eBPF programs attached to it. You read it right: the eBPF programs live attached to a map, not attached to a socket, cgroup or network interface as usual. This is how you would set up SOCKMAP in user program:

sock_map = bpf_create_map(BPF_MAP_TYPE_SOCKMAP, sizeof(int), sizeof(int), 2, 0)

prog_parser = bpf_load_program(BPF_PROG_TYPE_SK_SKB, ...)
prog_verdict = bpf_load_program(BPF_PROG_TYPE_SK_SKB, ...)
bpf_prog_attach(prog_parser, sock_map, BPF_SK_SKB_STREAM_PARSER)
bpf_prog_attach(prog_verdict, sock_map, BPF_SK_SKB_STREAM_VERDICT)

Ta-da! At this point we have an established sock_map eBPF map, with two eBPF programs attached: parser and verdict. The next step is to add a TCP socket descriptor to this map. Nothing simpler:

int idx = 0;
int val = sd;
bpf_map_update_elem(sock_map, &idx, &val, BPF_ANY);

At this point the magic happens. From now on, each time our socket sd receives a packet, prog_parser and prog_verdict are called. Their semantics are described in the strparser.txt and the introductory SOCKMAP commit. For simplicity, our trivial echo server only needs the minimal stubs. This is the eBPF code:

SEC("prog_parser")
int _prog_parser(struct __sk_buff *skb)
{
	return skb->len;
}

SEC("prog_verdict")
int _prog_verdict(struct __sk_buff *skb)
{
	uint32_t idx = 0;
	return bpf_sk_redirect_map(skb, &sock_map, idx, 0);
}

Side note: for the purposes of this test program, I wrote a minimal eBPF loader. It has no dependencies (neither bcc, libelf, nor libbpf) and can do basic relocations (like resolving the sock_map symbol mentioned above). See the code.

The call to bpf_sk_redirect_map is doing all the work. It tells the kernel: for the received packet, please oh please redirect it from a receive queue of some socket, to a transmit queue of the socket living in sock_map under index 0. In our case, these are the same sockets! Here we achieved exactly what the echo server is supposed to do, but purely in eBPF.

This technology has multiple benefits. First, the data is never copied to userspace. Secondly, we never need to wake up the userspace program. All the action is done in the kernel. Quite cool, isn't it?

We need one more piece of code, to hang the userspace program until the socket is closed. This is best done with good old poll(2):

/* Wait for the socket to close. Let SOCKMAP do the magic. */
struct pollfd fds[1] = {
    {.fd = sd, .events = POLLRDHUP},
};
poll(fds, 1, -1);

Full code.

The benchmarks

At this stage we have presented four simple TCP echo servers:

  • naive read-write loop

  • splice

  • io_submit

  • SOCKMAP

To recap, we are measuring the cost of three things:

  1. Syscall cost

  2. Wakeup latency, mostly visible as tail latency

  3. The cost of copying data

Theoretically, SOCKMAP should beat all the others:

syscall cost

waking up userspace

copying cost

read write loop

2 syscalls

yes

2 copies

splice

2 syscalls

yes

0 copy (?)

io_submit

1 syscall

yes

2 copies

SOCKMAP

none

no

0 copies

Show me the numbers

This is the part of the post where I'm showing you the breathtaking numbers, clearly showing the different approaches. Sadly, benchmarking is hard, and well... SOCKMAP turned out to be the slowest. It's important to publish negative results so here they are.

Our test rig was as follows:

  • Two bare-metal Xeon servers connected with a 25Gbps network.

  • Both have turbo-boost disabled, and the testing programs are CPU-pinned.

  • For better locality we localized RX and TX queues to one IRQ/CPU each.

  • The testing server runs a script that sends 10k batches of fixed-sized blocks of data. The script measures how long it takes for the echo server to return the traffic.

  • We do 10 separate runs for each measured echo-server program.

  • TCP: "cubic" and NONAGLE=1.

  • Both servers run the 4.14 kernel.

Our analysis of the experimental data identified some outliers. We think some of the worst times, manifested as long echo replies, were caused by unrelated factors such as network packet loss. In the charts presented we, perhaps controversially, skip the bottom 1% of outliers in order to focus on what we think is the important data.

Furthermore, we spotted a bug in SOCKMAP. Some of the runs were delayed by up to whopping 64ms. Here is one of the tests:

Values min:236.00 avg:669.28 med=390.00 max:78039.00 dev:3267.75 count:2000000
Values:
 value |-------------------------------------------------- count
     1 |                                                   0
     2 |                                                   0
     4 |                                                   0
     8 |                                                   0
    16 |                                                   0
    32 |                                                   0
    64 |                                                   0
   128 |                                                   0
   256 |                                                   3531
   512 |************************************************** 1756052
  1024 |                                             ***** 208226
  2048 |                                                   18589
  4096 |                                                   2006
  8192 |                                                   9
 16384 |                                                   1
 32768 |                                                   0
 65536 |                                                   11585
131072 |                                                   1

The great majority of the echo runs (of 128KiB in this case) were finished in the 512us band, while a small fraction stalled for 65ms. This is pretty bad and makes comparison of SOCKMAP to other implementations pretty meaningless. This is a second reason why we are skipping 1% of worst results from all the runs - it makes SOCKMAP numbers way more usable. Sorry.

2MiB blocks - throughput

The fastest of our programs was doing ~15Gbps over one flow, which seems to be a hardware limit. This is very visible in the first iteration, which shows the throughput of our echo programs.

This test shows: Time to transmit and receive 2MiB blocks of data, via our tested echo server. We repeat this 10k times, and run the test 10 times. After stripping the worst 1% numbers we get the following latency distribution:

numbers-2mib-2

This chart shows that both naive read+write and io_submit programs were able to achieve 1500us mean round trip time for TCP echo server of 2MiB blocks.

Here we clearly see that splice and SOCKMAP are slower than others. They were CPU-bound and unable to reach the line rate. We have raised the unusual splice performance problems in the past, but perhaps we should debug it one more time.

For each server we run the tests twice: without and with SO_BUSYPOLL setting. This setting should remove the "wakeup latency" and greatly reduce the jitter. The results show that naive and io_submit tests are almost identical. This is perfect! BUSYPOLL does indeed reduce the deviation and latency, at a cost of more CPU usage. Notice that neither splice nor SOCKMAP are affected by this setting.

16KiB blocks - wakeup time

Our second run of tests was with much smaller data sizes, sending tiny 16KiB blocks at a time. This test should illustrate the "wakeup time" of the tested programs.

numbers-16kib-1

In this test the non-BUSYPOLL runs of all the programs look quite similar (min and max values), with SOCKMAP being the exception. This is great - we can speculate the wakeup time is comparable. Surprisingly, the splice has slightly better median time from others. Perhaps this can be explained by CPU artifacts, like having better CPU cache locality due to fewer data copying. SOCKMAP is again, slowest with worst max and median times. Boo.

Remember we truncated the worst 1% of the data - we artificially shortened the "max" values.

TL;DR

In this blog post we discussed the theoretical benefits of SOCKMAP. Sadly, we noticed it's not ready for prime time yet. We compared it against splice, which we noticed didn't benefit from BUSYPOLL and had disappointing performance. We noticed that the naive read/write loop and iosubmit approaches have exactly the same performance characteristics and do benefit from BUSYPOLL to reduce jitter (wakeup time).

If you are piping data between TCP sockets, you should definitely take a look at SOCKMAP. While our benchmarks show it's not ready for prime time yet, with poor performance, high jitter and a couple of bugs, it's very promising. We are very excited about it. It's the first technology on Linux that truly allows the user-space process to offload TCP splicing to the kernel. It also has potential for much better performance than other approaches, ticking all the boxes of being async, kernel-only and totally avoiding needless copying of data.

This is not everything. SOCKMAP is able to pipe data across multiple sockets - you can imagine a full mesh of connections being able to send data to each other. Furthermore, it exposes the strparser API, which can be used to offload basic application framing. Combined with kTLS you can combine it with transparent encryption. Furthermore, there are rumors of adding UDP support. The possibilities are endless.

Recently the kernel has been exploding with eBPveF innovations. It seems like we've only just scratched the surface of the possibilities exposed by the modern eBPF interfaces.

Many thanks to Jakub Sitnicki for suggesting SOCKMAP in the first place, writing the proof of concept and now actually fixing the bugs we found. Go strong Warsaw office!

TCPAPISecurity