惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

WordPress大学
WordPress大学
阮一峰的网络日志
阮一峰的网络日志
J
Java Code Geeks
宝玉的分享
宝玉的分享
C
CXSECURITY Database RSS Feed - CXSecurity.com
P
Privacy International News Feed
The Register - Security
The Register - Security
T
Threat Research - Cisco Blogs
Recent Commits to openclaw:main
Recent Commits to openclaw:main
PCI Perspectives
PCI Perspectives
Hugging Face - Blog
Hugging Face - Blog
T
Tailwind CSS Blog
酷 壳 – CoolShell
酷 壳 – CoolShell
N
News | PayPal Newsroom
Google Online Security Blog
Google Online Security Blog
aimingoo的专栏
aimingoo的专栏
F
Full Disclosure
P
Palo Alto Networks Blog
A
About on SuperTechFans
Microsoft Azure Blog
Microsoft Azure Blog
F
Fortinet All Blogs
爱范儿
爱范儿
Recorded Future
Recorded Future
月光博客
月光博客
T
True Tiger Recordings
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Tenable Blog
L
Lohrmann on Cybersecurity
博客园 - 聂微东
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
大猫的无限游戏
大猫的无限游戏
S
Security @ Cisco Blogs
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
L
LINUX DO - 热门话题
Hacker News: Ask HN
Hacker News: Ask HN
C
Check Point Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
L
LangChain Blog
The Cloudflare Blog
Malwarebytes
Malwarebytes
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
I
InfoQ
N
Netflix TechBlog - Medium
Recent Announcements
Recent Announcements
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
IntelliJ IDEA : IntelliJ IDEA – the Leading IDE for Professional Development in Java and Kotlin | The JetBrains Blog
SecWiki News
SecWiki News
云风的 BLOG
云风的 BLOG
T
ThreatConnect
博客园 - 叶小钗
B
Blog

DEV Community

Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate I reproduced a Claude Code RCE. The bug pattern is everywhere. Jenkins CI/CD Pipeline for a Dockerized Node.js Application: Manual Trigger vs Automatic Trigger Using GitHub Webhooks How to Stream Live Forex Rates to Google Sheets API: A Complete Guide Small Models Will Beat Giant Models (And Most People Haven’t Realized Why Yet) How I Built 5 Linux Automation Scripts on AWS EC2 I built TokenPatch to measure AI coding cost per applied patch I built a Chrome extension to stop squinting at the web Producer audit clean, six tests red Conversa — A Multi-Agent AI Platform Powered by Gemma 4 Build a Real Agent in 15 Minutes with Gemini's New Managed Agents API What I Actually Build: AI Systems That Ship, Not Demos That Impress The Box Ticked While You Read This: LinkedIn, AI Training, and the Switch You Did Not Flip Investasi Masa Depan: Mengintip Fasilitas Laboratorium Komputer Kelas Dunia di Yogyakarta I Cancelled My $20 Claude Cowork Plan After a Week With OpenWork Stop Reviewing Every Line of AI Code - Build the Trust Stack Instead How To Build an Image Cropper in Browser (Simple Steps) I built a macOS disk cleaner for developers and just launched it would love feedback Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI Your AI can read. Gemma 4 can see The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead I Finally Finished a Project I Abandoned — And GitHub Copilot Helped Me Ship It SafeSMS: On-Device Threat Detection with Gemma 4 E4B, no internet required I Built OpenKap — A Loom Alternative for Small Teams Who Just Want to Ship Gemma 4 is Here: The Dawn of Local Multimodal Reasoning Offline-First Flutter: How We Built a CRM That Manages 100K+ Leads With No Internet Memory for Agents: When Vectors Meet Graphs, Bugs Drop 4 The Rise of Production-Grade AI Infrastructure I ran my idea-validation product through its own validator. The verdict was PIVOT. We Built an Agent Commerce API. Google I/O 2026 Changed Our 3-Month Roadmap in 24 Hours. "My Partner's Memory Was Full. I Didn't Know — Until We Tried to Talk." I’m a Front End Web Developer Learning Machine Learning From Scratch Laravel Waiting Request I Built a Chrome Extension to Track How Long You Actually Spend on Each Tab Why Google Can't See Your React Breadcrumbs (And the 4-Line Fix) AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards Microsoft tried to kill the printer driver. Healthcare said no. The Blueprint Beneath the Blueprint: Designing Data Model and Choosing Its Database REST APIs vs Webhooks in Telecom Billing - Which One Actually Makes Sense? Accounting Made Simple: AI-Powered Financial Insights of Japanese Companies with Gemma 4 The append-only AST trick that makes Flutter AI chat actually smooth Designing the Future of Payments — Why XML Still Matters in the Age of APIs From Legacy to Live — Reviving XMLPayments with GitHub Copilot Two Weeks Into Learning Solana XMLPayments — The Hidden Backbone of Modern Financial Orchestration AI Agents in Practice — Read from the beginning Reviving My Gemma Agentic Framework: From Prototype to Polished Repo Smart Contracts Demand Better Infrastructure: Building on contract.dev Self-Hosted LLM Tool Calling: Forge and the Build-vs-Buy Decision ORA-00072 오류 원인과 해결 방법 완벽 가이드 OpenWA for CTOs: Self-Hosted WhatsApp Gateway Trade-Offs NotebookLM Automation With notebooklm-py: Useful, But Classify Data First Docker v29.5.x Operator Upgrade Checklist Coding-Agent Instruction Design: The CLAUDE.md File That Prevents Rework When I Finally Realized My Runtime Was Holding Me Back GnokeOps: Host Your Own AI House Party The Death of Static Rate Limiters: Why Your Java Virtual Threads Need BBR-Style Adaptive Concurrency AI Agents in Practice — Part 2: What Makes Something an Agent Stop scattering LLM SDK/API calls across your codebase. Here is the 2-file rule that fixed mine Beyond Prompts: Structuring AI Workflows for Real Frontend Engineering From an Abandoned Hackathon Project to an AI Study Workspace 🚀 Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? 750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Five Clusters. Five Lessons. One Production System. Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer. Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals You Don’t Need to Try Every AI Tool to Keep Up NovelPilot: A Novel Writing Agent Powered by Gemma 4 BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside. BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090
We Replaced Our RAG Pipeline With Persistent KV Cache. Here's What We Found.
Prashanth Ma · 2026-05-23 · via DEV Community

RAG has become the default answer for giving LLMs access to private knowledge. And for good reason — it works. But after running it in production we kept hitting the same wall. Not retrieval accuracy. The operational tax.

Re-embedding on data changes. Chunking drift. Retrieval misses on edge cases. Pipeline failures at 2am. The vector database that needs babysitting.

So we ran an experiment.

The Hypothesis
What if instead of chunking, embedding, and retrieving — we just loaded the full document into the LLM context, cached the KV state persistently, and reused it across every query?

No retrieval step. No embedding pipeline. No vector database. Just the model with full document context, warm and ready.

How It Works
The core idea is simple. When an LLM processes a prompt it generates a key-value attention cache — the internal representation of everything it has read. Normally this cache is transient. It lives in VRAM during the request and disappears after.
We persist it.
The initialization prompt — your document — gets processed once. The resulting KV cache gets stored externally and indexed to that document. Every subsequent query retrieves that cached state and appends the user query. The model never recomputes the document. Ever.

The math:
KV_init = LLM.prefill(document)
KV_store[document_id] = KV_init

# On every query:
KV_full = KV_store[document_id] + LLM.prefill(query)
output = LLM.decode(KV_full)

What We Found

Answer quality improved.
No retrieval misses are possible when the full document is in context. The model has read everything. It doesn't guess which chunks are relevant — it knows the whole document. For complex multi-part questions that span different sections this is a significant improvement over chunked retrieval.

Updates became trivial.

Document changes? Re-run the prefill, store the new KV cache. Minutes not hours. No re-embedding pipeline. No re-indexing. No retrieval regression testing. Just regenerate and deploy.

Operational complexity dropped.

No embedding model to maintain. No vector database to monitor. No chunking strategy to tune. No retrieval quality metrics to track. The surface area for things to break quietly got dramatically smaller.
Latency on warm cache is effectively instant.

When the KV state is already loaded the query just appends and generates. No retrieval hop, no context injection latency.
The Honest Tradeoffs

Context window is the ceiling.

Current limit is around 120k tokens — roughly 200-300 pages. Works well for focused documents. For large corpora you need a routing layer to select the right cache per query. You've pushed the retrieval problem up one level — instead of retrieving chunks you're selecting a cache. Simpler problem but not zero.

Cold cache restore adds latency.

The first query after a cache restore pays a latency cost. For strict SLA requirements this matters. Warm cache is instant. Cold restore depends on your infrastructure.

Initial prefill costs more than embedding.

Running a full forward pass on a large document costs more compute than embedding it. The economics work when query volume is high enough to amortize that cost. Low query, high update frequency — RAG still wins.

Where This Wins

This approach is clearly better when:

You have a focused, structured document — legal contract, compliance policy, product manual, technical spec
Query volume is high relative to update frequency
Full context comprehension matters more than breadth
You want to eliminate pipeline maintenance entirely
Privacy matters — no document chunks sent to embedding APIs

Where RAG Still Wins

Very large document collections where context limits apply
Highly dynamic data that changes multiple times per day
When you genuinely don't know which document is relevant at query time
Low query volume where prefill cost doesn't amortize

What We're Building

We've been running this in production at InferX as part of our Sovereign Endpoints™ infrastructure. The persistent KV cache layer sits on top of our GPU snapshotting architecture — which is what makes the cold cache restore fast enough to be practical.
We're now opening a limited beta for teams who want to test this on real workloads. Particularly interested in legal, compliance, finance, and developer tooling use cases.
If you're running RAG in production and want to run a head-to-head comparison — we'd love to work with you.