惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

The New Stack | DevOps, Open Source, and Cloud Native News

I buried 20 problems in a fake P&L to see if Claude for Small Business could find them Why enterprise AI keeps stalling — and how data streaming could unlock it JFrog report recaps a tumultuous year in supply chain security Kore counts down to Artemis, its moonshot for governable AI agents How to build your first end-to-end AI workflow in n8n CI wasn’t built for coding agents. Here’s what comes next. “Morally repugnant shortsightedness”: Why open source security leaders say companies must stop freeloading on maintainers After becoming cloud computing’s telemetry standard, OpenTelemetry graduates into the AI infrastructure era Building the agentic agreement enterprise: How developers are unlocking agentic experiences with Docusign’s MCP server and platform Cut your AI search costs without sacrificing quality NanoCo bets the future of enterprise AI is one sandboxed agent per employee Why six AI labs built the same product for knowledge workers in four months LLMs were trained on an inaccessible web — AudioEye data shows AI is still building one Cursor bets on cheaper coding with Composer 2.5 and Kimi K2.5 At Google I/O 2026, Antigravity gets a new job description Anthropic hires OpenAI co-founder Andrej Karpathy to lead Claude pre-training research Google launches $100 AI Ultra plan and cuts top tier to $200 Google’s Gemini 3.5 Flash beats the frontier models Google now lets developers use GPT and Claude in Android Studio Google wants to make the web agent-ready Google now lets you vibe code native Android apps in AI Studio Valkey just had a 17x year. Its lead maintainer still doesn’t want Redis to die. Anthropic debuts MCP tunnels and self-hosted sandboxes to lock down AI agent infrastructure Steve Yegge’s AI agent orchestration project Gas Town comes to the cloud — and brings the Wasteland with it Pulumi bets infrastructure’s next decade belongs to AI agents Why Google’s Remy leaks have enterprise architects rethinking the AI stack GitHub will start paying some bug bounty hunters in swag instead of cash AI security readiness is now the No. 1 obstacle to adoption, Linux Foundation finds The Mac mini just became infrastructure The cleanup cost of AI-generated code GitHub takes aim at Claude Code and Codex with its new Copilot app Forward deployed engineer is AI’s hottest job as OpenAI and Google race to hire. Here’s how to become one. Why Block handed Goose to the Linux Foundation AWS found bugs in 60% of software requirements. Its fix isn’t more AI — it’s a 50-year-old logic engine. The software fix that could shrink AI’s energy bill without new hardware Why AI is failing in the security operations center The hidden cost of build vs. buy for agentic AI in regulated industries OpenAI brings Codex to the ChatGPT mobile app Cloud code: Conductor joins rush toward remote coding agents GitLab is betting a 19th-century economic theory will shape its AI era Anthropic splits billing again: Agent SDK gets separate credit pools The Rust sidecar pattern that fixes Python AI’s biggest weakness Fivetran’s CPO: Closed data stacks won’t survive the agent era MinIO’s MemKV promises 95% better GPU utilization by ending AI recompute tax Red Hat’s skill packs give AI agents something a bigger model never could: 20 years of institutional memory Anthropic’s Claude Code agent view is a better dashboard. So why aren’t developers convinced? OpenAI’s Daybreak and Anthropic’s Glasswing have nearly identical benchmarks — and 3 of the same partners I tested OpenAI’s three claims about GPT-5.5 Instant, and only one fully held up Temporal hits 3,000 paying customers with its crash-proof workflow engine Cloud native application challenges: installing the walking skeleton Cimento emerges from stealth to secure the one thing no firewall can protect Why agent harnesses fail inside cloud-native systems How to build a skills library for your engineering team Why enterprise AI needs customization The new FinOps problem isn’t cloud bills Jensen Huang and Bill McDermott bet on OpenShell to secure enterprise AI agents The API portal is the clearest signal of whether your company can handle AI agents AI is creating a generation of developers who can’t debug their own code Red Hat is betting on AgentOps to close the gap between AI experiments and production AI teams are spending months on web scrapers that SerpApi replaces with one API call Living off the agent: The new tactic hijacking enterprise AI SAP launches AI Agent Hub at Sapphire 2026 to tame vendor agent sprawl SAP launches managed Joule Studio with Cursor and Claude Code support As agentic dev tools boom, workflow auditability becomes the constraint Anthropic’s Claude Platform comes to AWS Anthropic trains Claude to resist blackmail & self-preservation behavior via agentic misalignment How AI-native systems are built Why your AI agent doesn’t actually remember anything Why 157,000 developers are hedging against Anthropic with OpenCode Claude can now follow users across Outlook, Word, Excel, and PowerPoint Why Prometheus couldn’t see Cilium metrics at 2 a.m. Anthropic puts the “myth” in Mythos with its HackerOne bug bounty program The attack surface moved inside the agent. So did Arcjet. Tanzu Platform’s 15-year head start meets the AI moment Datadog and T-Mobile leaders reveal the reality of deploying AI agents in production How Anthropic and Elon Musk cornered Sam Altman this week OpenAI Codex arrives in the browser with new Chrome extension “Several known limitations”: Developers react to Cursor’s promising but still-moving SDK AI startups are scrambling to survive in big tech’s shadow “The terminal still matters”: Amp rebuilds its CLI for an agentic future beyond the command line Anthropic recruited SpaceX’s 220,000-GPU Colossus 1 to fix what Claude users kept complaining about How Microsoft is governing thousands of Kubernetes clusters without manual intervention Temporal reveals serverless option for its Durable Execution platform OpenAI brings GPT-5-level reasoning to its speech models Elastic architects reveal how to query observability data in plain English I tested the new OpenAI Codex features on a real Python codebase, and it’s the strongest Claude Code rival yet GitHub builds an immune system for AI coding agents running on MCP With the launch of Meko, Yugabyte targets the data layer that’s breaking multi-agent AI systems The introverts’ edge: How AI is leveling the developer floor How a Cursor AI agent wiped PocketOS’s production database in under 10 seconds Why long-running AI agents break on HTTP and how Ably is fixing it Anthropic will let its managed agents dream Developers will use whatever AI coding tool they want. ServiceNow is building for that reality. Why Atlassian is letting Claude Code into its own data graph Kubernetes finally lands user namespace support, but shared kernel problem remains The company that made RAG mainstream is now betting against it Why PHP performance keeps getting bumped from the roadmap How NetEase Games cut LLM cold starts from 42 minutes to 30 seconds Why the Linux Foundation adopted MCP, with Jim Zemlin and Mazin Gilbert “AI systems do not understand”: New report flags systemic failures in AI coding
Why production RAG systems give confident, wrong answers at scale
Monica White · 2026-05-19 · via The New Stack | DevOps, Open Source, and Cloud Native News

In production RAG systems, the biggest bottleneck usually isn’t the LLM. It’s retrieval.

Most teams start with a simple pattern: Encode the query, retrieve a handful of documents from a vector database, pass them to the model, and generate an answer. With small, well-organized datasets, this feels almost magical. The right document is usually among the top results. Context is clean. The system appears fast, accurate, and reliable.

But this success is an illusion.

As data grows, a few hundred documents become millions with messy metadata, duplicate versions, access controls, and ambiguous language, all under real latency constraints. The probability that the right document appears in the top results drops sharply. Retrieval quality degrades quietly, long before anyone notices.

The system still produces answers. But now the model is working with an incomplete or irrelevant context. It compensates by filling in the gaps. Responses remain fluent and confident, but increasingly wrong. What looked like intelligence starts to feel unreliable.

“RAG systems rarely break because the model is weak. They break because retrieval architectures designed for tidy demos collapse under production scale.”

At that point, teams often blame embeddings, prompts, or model size. But the failure happens earlier. RAG systems rarely break because the model is weak. They break because retrieval architectures designed for tidy demos collapse under production scale.

The problem is not intelligence. It is recall.

The retrieval gap

Imagine a company building an internal knowledge assistant for ten thousand employees. The system must search ten million documents: financial memos, technical specs, project plans, and meeting notes. Responses must arrive within 2 seconds. Financial answers must be correct.

An engineer asks:

“What was the final decision on the Helios project budget in Q4, ignoring drafts?”

The system retrieves ten documents. None contains the approved budget memo. Several contain early discussions. The language model produces a confident but incorrect answer.

Nothing is broken. The model behaved exactly as designed. It summarized the context it received. The failure isn’t “LLM hallucination.” The right evidence never made it into the context.

Workflow diagram showing how a query can cause an LLM to output a wrong answer

This isn’t an edge case. It’s what happens when retrieval systems built for small datasets meet production scale.

Why retrieval fails at scale

Large corpora behave differently from small ones. Relevant documents are buried deeper in ranking distributions. Metadata matters more. Exact terminology matters more. Permissions and filtering become essential. Latency budgets become strict.

Retrieving only a handful of candidates becomes statistically unreliable. The best document might be ranked 300th by semantic similarity but first by exact keyword match. Or filtered out by metadata. Or overshadowed by drafts.

“Once retrieval misses the target, the rest of the pipeline cannot recover. No prompt can fix missing context.”

Once retrieval misses the target, the rest of the pipeline cannot recover. No prompt can fix missing context. No larger model can infer information that was never retrieved.

The architecture that actually scales

Production RAG is not just a smarter prompt or a bigger model. It is a different retrieval architecture.

Instead of fetching a few candidates and hoping one is correct, scalable systems cast a wide net, apply filtering during retrieval, and progressively refine results through multiple ranking stages. Retrieval becomes a unified serving pipeline rather than a chain of disconnected services.

Workflow diagram showing how scalable systems cast a wider net, receiving higher-quality context and generating a more grounded response.

The system retrieves many candidates quickly, filters early, ranks cheaply, reranks selectively, and sends only the best evidence to the model.

Deep retrieval and progressive ranking

Scalable retrieval works like a funnel. First, gather a large candidate pool using fast approximate methods. Then score all candidates using cheap signals such as lexical relevance and embedding similarity. Finally, apply expensive neural rerankers only to the most promising subset.

This structure controls both quality and cost. Expensive computation is focused where it matters most.

Workflow diagram showing how scalable retrieval applies expensive neural rerankers to the most promising subset.

Wide recall at the top. Precision at the bottom. Without this funnel, systems face a tradeoff between accuracy and latency. With it, they achieve both.

The four scaling cliffs

Seen this way, the failure modes become clear:

Cliff #1: Candidate generation is too shallow. The correct document never enters the ranking pipeline.

Cliff #2: Retrieval is fragmented across multiple services. Each network call adds latency and introduces data inconsistency. Scores from different systems are not directly comparable.

Cliff #3: Expensive reranking is applied too broadly. Neural models run on hundreds or thousands of candidates, inflating cost and response time.

Cliff #4: Prompt engineering is used as a substitute for retrieval quality. When context is wrong, output remains wrong.

These are not model problems. They are serving architecture problems.

Building RAG that actually scales

Production RAG requires a different retrieval architecture.

At a small scale, retrieval can behave like a loose pipeline of disconnected components. At the production scale, that approach breaks down. Retrieval must operate as a unified serving system that maximizes recall, controls latency, and progressively refines results.

The following four principles define what scalable retrieval systems get right.

Principle #1: Treat retrieval as a serving system

The first shift is conceptual: retrieval is not a workflow;, it is a serving problem.

Stop thinking in terms of disconnected steps:

embedding service → vector database → filter script → reranker → LLM

Start thinking in terms of a unified system:

retrieval engine (hybrid search + filtering + ranking) → LLM

In production, these components cannot operate in isolation. Hybrid search, metadata filtering, and ranking must execute together, on the same data, within a single query path.

Vector similarity alone is not enough. Real queries depend on semantic understanding, exact keyword matching, structured filters such as time, entity, and permissions, andas well as learned ranking signals. These signals need to interact directly, not be stitched together across multiple services.

Systems like Vespa are designed around this idea, executing hybrid retrieval and multi-stage ranking inside a single serving layer. This avoids synchronization issues and eliminates unnecessary network hops.

The specific platform matters less than the architecture. What matters is that retrieval is integrated rather than fragmented across services, low latency rather than distributed across multiple execution paths, and progressively selective, moving from broad recall to precise ranking.Once retrieval is treated as a system, the next question becomes: how do you ensure it actually finds the right information?

Principle #2: Hybrid retrieval + large candidate sets

Maximize recall in the candidate generation stage by combining hybrid retrieval with a sufficiently large top-K.

Semantic search captures conceptual similarity, while keyword search captures exact matches. Real-world queries depend on both. For example, a financial approval memo may not be semantically close to “budget decision,” yet it will contain exact project names, dates, and approval language. Pure semantic retrieval can miss it, while pure keyword search can miss related context. Hybrid retrieval combines both signals, significantly increasing coverage.

But the retrieval method alone is not enough. Candidate generation is fundamentally a recall problem, not a precision problem. If the relevant document never enters the candidate set, no downstream ranking model can recover it. That is why top-K should be set intentionally large, especially as corpus size and query ambiguity grow. Hybrid retrieval expands coverage across semantic and lexical signals, while larger candidate sets increase the probability that the correct document survives into later ranking stages. At this stage, recall matters more than precision.

High recall creates the conditions for effective ranking. Without it, the system is operating on an incomplete shortlist. 

With a strong candidate set in place, the next challenge becomes efficiency.

Principle #3: Multi-stage ranking matters

Neural rerankers are powerful, but too expensive to run across large candidate sets.

The solution is a multi-stage ranking pipeline. Early stages use fast, lightweight methods to eliminate obvious mismatches, while later stages apply more expensive models, such as cross-encoders or LLM-based rerankers, only to a smaller and higher-quality subset.

This structure balances relevance, latency, and cost. Early filtering reduces unnecessary computation, while expensive ranking models focus only on plausible candidates.

Without staging, systems face a difficult tradeoff: either run expensive models across everything and sacrifice latency, or restrict the candidate set too early and sacrifice recall. Multi-stage ranking removes that tradeoff, allowing systems to maintain large candidate pools while remaining efficient.

At this point, we have the mechanics. The final principle explains why all of this matters.

Principle #4: Retrieval quality determines system quality

Language models do not verify facts. They synthesize responses from the evidence they are given.

If the retrieved context is precise, answers are precise. If the context is noisy, answers become uncertain. If the context is wrong, the answers are wrong.

This makes retrieval the dominant factor in system performance. Importantly, retrieval quality is not the result of a single decision, but of the entire pipeline: how candidates are generated, how many documents are retrieved, how ranking is staged, and how much irrelevant context ultimately reaches the model. Focusing on any one component in isolation misses the point because these decisions interact.

That is why retrieval should be evaluated as a system. Measure recall during candidate generation, track how recall changes across ranking stages, inspect how much irrelevant context reaches the prompt, and understand where latency is introduced throughout the pipeline.

When recall is low anywhere in the system, nothing downstream can recover. Improving prompts without improving retrieval is cosmetic optimization. Improving retrieval changes outcomes.

“Improving prompts without improving retrieval is cosmetic optimization. Improving retrieval changes outcomes.”

RAG does not fail because language models are limited. It fails because retrieval pipelines are underspecified.

Small systems can tolerate shallow retrieval, fragmented components, and brute-force reranking. Large systems require deep candidate generation, unified serving, and staged computation. Scale forces discipline.

Production systems succeed when retrieval is treated as an end-to-end serving problem built around deep candidate generation, hybrid search, early filtering, progressive ranking, and precise context selection.

When the right evidence reaches the model, correct answers follow naturally.

The gap between a working demo and a reliable production system is not mysterious. It is architectural.

TRENDING STORIES

Group Created with Sketch.