We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.

RAG has become the default answer for giving LLMs access to private knowledge. And for good reason — it works. But after running it in production we kept hitting the same wall. Not retrieval accuracy. The operational tax.

Re-embedding on data changes. Chunking drift. Retrieval misses on edge cases. Pipeline failures at 2am. The vector database that needs babysitting.

So we ran an experiment.

The Hypothesis
What if instead of chunking, embedding, and retrieving — we just loaded the full document into the LLM context, cached the KV state persistently, and reused it across every query?

No retrieval step. No embedding pipeline. No vector database. Just the model with full document context, warm and ready.

How It Works
The core idea is simple. When an LLM processes a prompt it generates a key-value attention cache — the internal representation of everything it has read. Normally this cache is transient. It lives in VRAM during the request and disappears after.
We persist it.
The initialization prompt — your document — gets processed once. The resulting KV cache gets stored externally and indexed to that document. Every subsequent query retrieves that cached state and appends the user query. The model never recomputes the document. Ever.

The math:
KV_init = LLM.prefill(document)
KV_store[document_id] = KV_init

# On every query:
KV_full = KV_store[document_id] + LLM.prefill(query)
output = LLM.decode(KV_full)

What We Found

Answer quality improved.
No retrieval misses are possible when the full document is in context. The model has read everything. It doesn't guess which chunks are relevant — it knows the whole document. For complex multi-part questions that span different sections this is a significant improvement over chunked retrieval.

Updates became trivial.

Document changes? Re-run the prefill, store the new KV cache. Minutes not hours. No re-embedding pipeline. No re-indexing. No retrieval regression testing. Just regenerate and deploy.

Operational complexity dropped.

No embedding model to maintain. No vector database to monitor. No chunking strategy to tune. No retrieval quality metrics to track. The surface area for things to break quietly got dramatically smaller.
Latency on warm cache is effectively instant.

When the KV state is already loaded the query just appends and generates. No retrieval hop, no context injection latency.
The Honest Tradeoffs

Context window is the ceiling.

Current limit is around 120k tokens — roughly 200-300 pages. Works well for focused documents. For large corpora you need a routing layer to select the right cache per query. You've pushed the retrieval problem up one level — instead of retrieving chunks you're selecting a cache. Simpler problem but not zero.

Cold cache restore adds latency.

The first query after a cache restore pays a latency cost. For strict SLA requirements this matters. Warm cache is instant. Cold restore depends on your infrastructure.

Initial prefill costs more than embedding.

Running a full forward pass on a large document costs more compute than embedding it. The economics work when query volume is high enough to amortize that cost. Low query, high update frequency — RAG still wins.

Where This Wins

This approach is clearly better when:

You have a focused, structured document — legal contract, compliance policy, product manual, technical spec
Query volume is high relative to update frequency
Full context comprehension matters more than breadth
You want to eliminate pipeline maintenance entirely
Privacy matters — no document chunks sent to embedding APIs

Where RAG Still Wins

Very large document collections where context limits apply
Highly dynamic data that changes multiple times per day
When you genuinely don't know which document is relevant at query time
Low query volume where prefill cost doesn't amortize

What We're Building

We've been running this in production at InferX as part of our Sovereign Endpoints™ infrastructure. The persistent KV cache layer sits on top of our GPU snapshotting architecture — which is what makes the cold cache restore fast enough to be practical.
We're now opening a limited beta for teams who want to test this on real workloads. Particularly interested in legal, compliance, finance, and developer tooling use cases.
If you're running RAG in production and want to run a head-to-head comparison — we'd love to work with you.

推荐订阅源

DEV Community