When the conclusion comes first

A sponsored benchmark is just an experiment with a sponsor. So read it like one.

I read a lot of vendor benchmarks. Most are fine. They are marketing, sure, but the honest ones tell you what they tested, show their work, and let you decide if it maps to your world.

Then there are the other ones. The ones where you can feel the conclusion was written before the test was. You read the setup and realize the whole thing was arranged to arrive at a number that was decided in a meeting months earlier.

A new one landed recently (March 2026) that is a near-perfect teaching example, so I want to walk through it. Not to dunk on a product, but because the techniques in it show up everywhere, and once you can spot them you can spot them in any report.

The report is Tolly #226104, published in March 2026 and commissioned by F5. It compares F5's BIG-IP Next for Kubernetes running on an NVIDIA Bluefield DPU against three open source load balancers: HAProxy, Envoy, and a third "other open-source solution" that, oddly, never gets named. The pitch is that F5's intelligent AI load balancing crushes the open source field on token throughput, time to first token, and CPU usage in an AI inference cluster.

I want to be fair up front about two things, because they matter.

First, I have no reason to think the numbers are faked. I think the data was measured and reported honestly. That is not where these reports usually go wrong.

Second, the underlying ideas are real. Offloading network work to a DPU genuinely frees up host CPU. Routing requests around busy GPUs is genuinely smarter than routing blindly. Both of those are good engineering. If the report had simply said "our purpose-built, GPU-aware product on dedicated offload hardware beats a general-purpose proxy that was told to ignore GPU load," it would be true. It would also be boring and obvious, which is exactly why the report does not say that.

The problem is not the data. The problem is the experiment.

What the test was actually built to measure

Here is the core of the setup. Before each run, the testers manually loaded 50% of the GPUs with background traffic. That load did not pass through any of the load balancers. It was just dumped onto half the cluster to create a lopsided, half-congested pool.

Then they sent real traffic through each load balancer and measured the result.

F5's product is designed to watch GPU load and steer traffic toward the idle accelerators. The open source proxies were configured with plain round-robin, which sprays requests evenly without caring whether a GPU is already buried.

So the test does exactly one thing: it checks whether a load balancer can detect and avoid pre-loaded GPUs. F5 had that feature switched on. Everyone else had it switched off. The result was decided the moment they chose the configurations.

This is the part that should bother you, because round-robin is a strawman. No competent engineer runs static round-robin into a pool where half the backends are known to be slammed. HAProxy, Envoy, and the rest all support dynamic, load-aware algorithms and health-based routing that exist precisely for uneven backends. None of that was turned on. The open source tools were handed the dumbest possible config and then measured at the one task that config is guaranteed to fail.

Would smarter configs have closed the gap completely? I honestly do not know, and I am not going to pretend otherwise. Connection-aware routing is not the same thing as GPU-load-aware routing, and F5's product was built specifically for this scenario. The fair comparison, with every tool tuned by someone who actually wanted it to do well, was never run. That is the whole point. The report does not answer the interesting question. It answers a rigged one and prints the rigged answer in 40-point font.

The CPU number is a hardware fact dressed as a software win

The other headline is that F5 used about 2 CPU cores while HAProxy used about 12, out of 16 available. Roughly 80% less CPU.

That gap is real, and it is also not surprising, because F5 was running offloaded to a DPU with its own dedicated ARM cores and HAProxy was running on the host. Of course the host uses less CPU when you move the work to a separate chip. That is what offload hardware is for. The honest version of this claim is "dedicated offload silicon uses fewer host cores than host software," which is true of basically any offload, for any workload, forever.

Attributing that to software efficiency is like winning a footrace because you brought a motorcycle, then writing up your superior running form.

How to read any report like this

Strip away the specifics and you are looking at an experiment. So judge it the way you would judge any experiment.

Follow the money. Who paid for it, and does the result flatter them? F5 commissioned this. That alone is not damning, plenty of good research is funded, but it raises the bar for everything else.
Count the changed variables. A clean experiment changes one thing. This one changed the hardware platform, the routing algorithm, and the software maturity all at once, then credited the winner's software. When several variables move together, you cannot attribute the outcome to any single one.
Check the baseline. Was the thing being beaten given a fair shot, or set up to lose? Round-robin into a half-loaded pool is a baseline chosen to fail.
Watch for cherry-picking. The "up to 40 / 61 / 34%" headline quietly picks the worst competitor for each separate metric. Against Envoy, the throughput edge was 21% and the latency edge 17%. And the most jaw-dropping figures, 114% and 406%, came from smaller models that were only ever tested against HAProxy, "due to time constraints."
Try to reproduce it. Could you? An unnamed third competitor, early-access software, and no published configs mean no, you cannot. Real results survive other people repeating them.

Tolly, to their credit, says the quiet part in the fine print. Their own terms note that tests "may have been tailored to reflect performance under ideal conditions." That sentence is doing a lot of work.

The part that actually disappoints me

None of this is illegal, and Tolly is a long-standing testing house that documented its setup well enough that a careful reader can take it apart. I would rather have this report, with its methodology on the page, than a glossy claim with no test behind it at all.

What gets me is that F5 did not need to do this. They are an enormous company with real engineering and a product that, by their own description, has a genuine architectural idea behind it. They could have published a transparent, reproducible, tuned-on-both-sides comparison and let it stand on its merits. Instead they paid for a setup engineered to make the entire open source proxy ecosystem look primitive at a task the test was built to make it fail.

That is the move I find hard to respect. Not competing. Competing is great. It is using scale and a budget to manufacture a lopsided story about volunteer-built and community-built software that cannot commission a rebuttal report of its own.

So read the report. Seriously, the link is public at tolly.com/publications/226104. Read the methodology, not the bottom line. Then ask who paid, what changed, and whether you could ever run it yourself. If the answers do not hold up, the big number on the front page does not either.

Whether F5's product is actually better is a real and open question. This report just is not the thing that answers it.

Photo by Myles Bloomfield on Unsplash

推荐订阅源

DEV Community

What the test was actually built to measure

The CPU number is a hardware fact dressed as a software win

How to read any report like this

The part that actually disappoints me