TL;DR: We turned on vLLM's prefix cache for our agent workloads at Nexus Labs and watched TTFT drop from 480ms to 110ms on one tenant and stay exactly the same on another. The split wasn't about traffic volume. It was about how each team templated their system prompts.
The setup
Our fine-tuning team serves 14 enterprise agents through a shared inference cluster. Four H100 nodes, vLLM 0.6.x, Qwen2.5-32B as the workhorse model. Traffic is bursty. One customer's nightly workflow can hit 8k requests in twenty minutes while another trickles through 30 calls an hour.
Before turning on prefix caching, average TTFT across the cluster sat at 410ms p50, 1.2s p95. Cost wasn't the urgent problem. Latency was, because agents loop. A 400ms TTFT on a 12-step plan turns into 4.8 seconds of dead time before the user sees anything.
What the cache actually does
vLLM's prefix cache keeps KV blocks for tokens it has already processed. If a new request shares a prefix with something in the cache, those blocks get reused instead of recomputed. The unit is a block (16 tokens by default), so caching is greedy at block boundaries.
If your system prompt is 1,024 tokens and identical across requests, you skip prefill for 1,024 tokens. At Qwen2.5-32B prefill speeds, that's roughly 90 to 110ms saved per call on our hardware.
Where it worked
Tenant A's agent uses a fixed system prompt assembled at deploy time. Same 1,847 tokens for every request, byte-for-byte. After we flipped enable_prefix_caching=True:
- TTFT p50: 480ms → 110ms
- TTFT p95: 1.4s → 280ms
- GPU prefill compute dropped by 38%
Their hit rate ran around 94% steady-state. The 6% misses were cold starts after pod restarts.
Where it didn't
Tenant B's agent rebuilds its system prompt every call. They inject the current timestamp, a session UUID, and a hash of recent tool outputs into the first 200 tokens. Looked stable on paper. In practice, every request had a unique prefix starting at token 47.
vLLM caches at block granularity. One differing token in the first block invalidates everything after it. Tenant B's hit rate: 0.3%.
We didn't catch this in staging because our staging traffic replays canned prompts. The diff between tenants only showed up under real traffic.
The fix for Tenant B
I talked their team into pushing the volatile fields to the end of the prompt. Took two hours of refactoring on their side. After:
- TTFT p50: 510ms → 145ms
- Hit rate: 0.3% → 87%
Then they asked why nobody mentioned this in the vLLM docs. The docs do mention it. Nobody reads docs when defaults already look fine on the neighboring tenant.
Config
# vllm serve flags we landed on
--model Qwen/Qwen2.5-32B-Instruct
--enable-prefix-caching
--block-size 16
--gpu-memory-utilization 0.92
--max-num-seqs 256
--swap-space 16
--preemption-mode recompute
--preemption-mode recompute matters under memory pressure. We tried swap and watched the cache thrash when bursts hit. Recompute throws cache blocks away cleanly instead of evicting them to CPU and back.
Comparison
| Workload | Prompt structure | Hit rate | TTFT p50 before | TTFT p50 after |
|---|---|---|---|---|
| Tenant A (fixed) | Static 1,847-token prefix | 94% | 480ms | 110ms |
| Tenant B (before fix) | Volatile fields at token 47 | 0.3% | 510ms | 505ms |
| Tenant B (after fix) | Volatile fields moved to tail | 87% | 510ms | 145ms |
| Internal eval pipeline | Per-eval unique prompts | 4% | 390ms | 380ms |
The eval pipeline column is honest. Prefix caching does nothing for workloads where every prompt is genuinely unique. We left it on anyway because the overhead is negligible.
For routing across providers when we burst beyond self-hosted capacity, we run a small gateway in front (Bifrost is what we landed on, but the principle works with any of them). The local cache only helps for traffic that lands back on our own node, not the failover path.
Trade-offs and limitations
The cache costs GPU memory. We reserved roughly 14% of HBM for cached blocks at our max-num-seqs setting. That's tokens we can't use for batch concurrency. Worth it for us because TTFT mattered more than throughput. Not worth it if you're optimizing for tokens-per-second on offline batch.
Cache invalidation is binary at block boundaries. A one-token change at position 0 kills the whole prefix. No fuzzy matching. Semantic-caching products exist for that, but they're a different beast. They cache responses, not KV state, and the failure modes differ.
The cache is per-node. We have four nodes behind a round-robin LB, so the same prompt hits a cold cache 75% of the time on first contact. We considered sticky routing by prompt hash. Decided the complexity wasn't worth a 200ms improvement on first-contact latency. Maybe later.
The model is the easy part. Knowing where your tokens go is the hard part.





















