惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

N
News and Events Feed by Topic
L
LINUX DO - 最新话题
Help Net Security
Help Net Security
The Last Watchdog
The Last Watchdog
Attack and Defense Labs
Attack and Defense Labs
www.infosecurity-magazine.com
www.infosecurity-magazine.com
PCI Perspectives
PCI Perspectives
NISL@THU
NISL@THU
L
LINUX DO - 热门话题
K
Kaspersky official blog
P
Privacy International News Feed
Cloudbric
Cloudbric
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
CERT Recently Published Vulnerability Notes
A
Arctic Wolf
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
The GitHub Blog
The GitHub Blog
Blog — PlanetScale
Blog — PlanetScale
Security Archives - TechRepublic
Security Archives - TechRepublic
博客园 - Franky
博客园_首页
S
SegmentFault 最新的问题
小众软件
小众软件
G
Google Developers Blog
B
Blog
Last Week in AI
Last Week in AI
人人都是产品经理
人人都是产品经理
Project Zero
Project Zero
I
Intezer
L
Lohrmann on Cybersecurity
T
Threat Research - Cisco Blogs
V2EX - 技术
V2EX - 技术
Schneier on Security
Schneier on Security
Forbes - Security
Forbes - Security
T
Tenable Blog
T
The Blog of Author Tim Ferriss
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
M
MIT News - Artificial intelligence
量子位
The Hacker News
The Hacker News
C
Cisco Blogs
G
GRAHAM CLULEY
AWS News Blog
AWS News Blog
P
Privacy & Cybersecurity Law Blog
T
Troy Hunt's Blog
Hacker News: Ask HN
Hacker News: Ask HN
Recorded Future
Recorded Future
MyScale Blog
MyScale Blog
V
Visual Studio Blog
爱范儿
爱范儿

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
Building a multi-agent document-search copilot — Part 1: muddy results, and one strategy per query
Rodrigo Diego · 2026-06-24 · via DEV Community

Building a multi-agent document-search copilot — Part 1: muddy results, and one strategy per query

The first version ranked documents badly — and worse, it ranked them badly in a way that looked fine on the architecture diagram. Those are the bugs that get under my skin: every box is green, every arrow points the right way, and the answer is still wrong.

We were building a chat copilot over a regulated document store — the kind where a user types "show me my effective SOPs about equipment cleaning" and expects the right handful of documents back, ranked, with an excerpt and a reason. The v1 design did the obvious thing: run two retrieval lanes in parallel — a structured metadata lane and a semantic content lane — union the hits, rerank the union, render. Clean diagram. Muddy results. We'd open the demo, the pipeline would light up green end to end, and the list that came back was mush: the metadata rows polluted the semantic rank, the relevance scores stopped meaning anything, and there was no clean ordering left to show the user. The architecture was elegant. The experience was not.

This is a two-part story of how that became v2: one strategy per query, never mixed, a router that's a single structured-output call, and a Hybrid path that peeks at the data before it decides how to retrieve. It's an architecture post, so I'll keep it anchored in the specific decisions that actually moved — not a generic "how to build RAG" walkthrough. Part 1 (this post) is the problem and the first two reframes. Part 2 is the hard case — Hybrid — and the permission model.

🗺️ The series at a glance

This is Part 1 of 2.

Part 1 (this post) — the problem and the first two reframes:

  • 🌫️ v1 fused two parallel lanes into one rerank → muddy. Metadata rows have no text; a reranker scores text; fusing them corrupts the one number the UI depends on.
  • 📞 The router is one Bedrock structured-output call, not three sequential hops. Route + rewritten query + strategy + filters come back together, with a deterministic fallback. The catch: the merged task got too hard for a small model, so it runs on a bigger one.
  • 🎯 v2 picks exactly one shape per query: MetadataOnly, ContentOnly, or Hybrid. The lanes are never unioned or cross-scored.

Part 2 — the hard case and the safety model:

  • ⚖️ Hybrid is adaptive. It peeks the filtered-universe size once (threshold 1000) and routes on selectivity: small set → scope the semantic query to those ids (filter-first, authoritative); big set → rank unscoped and drop off-filter hits locally (rank-then-filter, harmless recall ceiling).
  • 🔒 The view-permission gate runs after rerank, and that's safe — because tenant isolation is enforced earlier, at retrieval, from the token.

🧭 The flow, end to end

Here's the whole turn. A message comes in, the supervisor routes it, the search graph runs one retrieval shape, reranks, gates on permissions, and finalizes for the UI:

                 ┌─────────────────────────────────────┐
   user turn ──► │ PLANNER (one Bedrock call)           │
                 │ route + rewrite + strategy + filters │
                 └──────────────┬──────────────────────┘
                                │ deterministic fallback if it fails
                                ▼
         ┌─────────────────── retrieval graph ───────────────────┐
         │   classify_intent                                     │
         │        │  (NoMatch / NeedsClarification skip ahead)   │
         │        ▼                                              │
         │   retrieve        ── picks ONE: Metadata / Content /  │
         │                      Hybrid (adaptive on selectivity) │
         │        ▼                                              │
         │   rank_results    ── Cohere over content; metadata    │
         │                      passes through unscored          │
         │        ▼                                              │
         │   access_gate     ── fail-closed VIEW gate on the     │
         │                      content lane only                │
         │        ▼                                              │
         │   format_response ── floor, shape, SSE                │
         └───────────────────────────────────────────────────────┘

The node order is exactly that, straight from the graph definition:

builder.add_edge(START, "query_rewriting")
builder.add_edge("query_rewriting", "intent_detection")
builder.add_conditional_edges(
    "intent_detection",
    _route_after_intent_detection,
    {"retrieve": "retrieve", "finalize_results": "finalize_results"},
)
builder.add_edge("retrieve", "rerank")
builder.add_edge("rerank", "permission_filter")
builder.add_edge("permission_filter", "finalize_results")

Four decisions made this work. Two of them are in this post; the other two are Part 2. I'll take them in order.

1️⃣ The router is one call, not three

The v1 router ran three sequential Bedrock hops per turn: one to pick a route, one to rewrite the user's query into a clean retrieval string, one to classify intent and extract filters. Three round trips, in series, before any retrieval even started. Each one waits on the last — and the user just watches a spinner the whole time.

My pushback in review was simple: don't run sequential model calls when one call can return the whole decision. A router's job is to emit a structured plan. There's no reason route, rewrite, and intent need to be three separate inferences — they're three fields of one object. So we collapsed them. The supervisor now makes a single structured-output call that returns a typed plan, which gets adapted into the downstream routing contract:

result = await bedrock_client.invoke_with_structured_response_with_fallback_async(
    messages=[SystemMessage(content=system), HumanMessage(content=message)],
    response_structure=Plan,
    chat_model_chain=model_chain,
    max_tokens=1024,
    temperature=0.0,
)

One call, temperature=0.0, and everything the downstream graph needs comes back together: the route, the rewritten query, the strategy, the structured filters. The RouterDecision it adapts into is the contract the search graph consumes:

class RouterDecision(BaseModel):
    route: Literal["documents.search", "documents.doc_context", "general.help"]
    rewritten_query: str = Field(default="")
    strategy: SearchStrategy  # MetadataOnly | ContentOnly | Hybrid | NoMatch | NeedsClarification
    search_value: str = Field(default="")
    filters: list[DocumentFilter] = Field(default_factory=list)
    ...

The twist that bit back. Merging three easy classifications into one harder one means the model now has to do all of it in a single pass — and the small/cheap model we wanted started getting it wrong. So the router moved up to a stronger (and slower) model. The "3 calls → 1 call" math promised a big latency win; the model upgrade promptly taxed a chunk of it back. (Presenting a "latency win" that your own model bump immediately claws into is a humbling little moment — I recommend the experience to no one.) We shipped anyway, because correctness moved the right way and the architecture got dramatically simpler — and because of the second non-negotiable below.

💡 The reusable lesson: a call-merge latency win can be partly clawed back by the accuracy upgrade the merge forces. Budget for that. And always keep a deterministic fallback.

The deterministic fallback is the part I'd flag for anyone copying this. The structured call returns None on any failure (the model is down, the output won't parse, the plan isn't exactly one step), and the caller drops to a non-LLM router:

fallback = route_request(body, intent_resolver=resolve_doc_intent)
plan, router_usage = await supervise_turn(...)
routing = plan_to_router_decision(plan)
# routing is None -> use fallback

The copilot never hangs on a flaky router call. If the smart path can't answer, a dumb-but-reliable path does. That's what lets you run the router on a heavier model without making it a single point of failure.

2️⃣ One strategy per query — the reframe that fixed the muddy results

This is the heart of the v2 change, and the part I argued hardest for — loudly, in more than one meeting.

The v1 retrieval ran two member-scoped lanes in parallel — an OpenSearch hybrid lane (vector + BM25) and a structured metadata lane — and fused them into one set before reranking. The intuition was "more recall is better, let the reranker sort it out." It doesn't work, and the reason is specific: a metadata hit and a content hit are not the same kind of object.

A cross-encoder reranker scores text. You hand it a query and a list of passages, it returns a relevance number per passage. A content chunk has text. A metadata row — "status = effective, author = me" — has no passage. When you stuff that row into the reranker by stringifying a title or a key-id, you get back a number that means nothing, on the same 0-1 scale as the real text scores. Nothing downstream can tell the calibrated score from the garbage one. The rank looks plausible and is quietly wrong.

So v2 picks exactly one retrieval shape per query and runs only that. The strategy comes from the router as a literal:

SearchStrategy = Literal[
    "MetadataOnly", "ContentOnly", "Hybrid", "NoMatch", "NeedsClarification"
]

Strategy What it means How it's retrieved Has a relevance rank?
MetadataOnly structured filters only ("my effective SOPs") Documents service, filtered rows in service order No — a satisfied filter isn't a relevance signal
ContentOnly topic search ("policy on cleaning between batches") OpenSearch hybrid over chunks, unscoped Yes — Cohere rerank
Hybrid topic and filters ("effective docs about cleaning") adaptive (Part 2) Yes — Cohere rerank
NoMatch not a document query (greeting, capability question) skips retrieval entirely n/a
NeedsClarification a document query too vague to retrieve ("show me stuff") skips retrieval, asks a clarifying question n/a

The lanes are never unioned or cross-scored anymore — retrieve says so in plain terms:

# The lanes are never unioned or cross-scored: a content turn returns content
# candidates, a metadata turn returns metadata candidates.

There's a nice side effect in the graph: NoMatch and NeedsClarification short-circuit straight to finalize_results, skipping retrieval, rerank, and the permission gate. A greeting shouldn't make the UI flash "Searching... / No matches" and shouldn't cost three round trips.

def _route_after_intent_detection(state) -> str:
    return ("finalize_results"
            if state.get("intent_strategy") in ("NoMatch", "NeedsClarification")
            else "retrieve")

The honest cost. Going single-strategy meant deprecating the old global keyword lane entirely. The practical fallout: custom-field values are no longer reachable as a filter, because the content-keyword fold that used to (badly) cover them is gone. That's a real regression on a narrow feature, parked until a dedicated Documents endpoint for custom fields lands. It stings to ship a regression on purpose — but I'd take a clean rank with one honest, documented gap over a muddy rank that hides a dozen.

💡 If you take one thing from this post: don't fuse object types into a single rerank set just to maximize recall. Pick the retrieval shape that matches the query, and run only that. Recall you can't rank cleanly is recall you can't show.

🎬 To be continued

So far the story has a tidy shape. One router call instead of three. One retrieval strategy per query instead of a fused blend. Pick MetadataOnly, or ContentOnly, and run only that. Clean.

But I skipped the strategy that refuses to be tidy — the one where the user genuinely wants both at once. "Effective SOPs about equipment cleaning" is a topic and a filter, and you can't honor it by picking a single lane. Running the filter and the rank in parallel is fast, but you have to reconcile two sets. Running them in sequence is precise, but slow. There's no single right answer — which is exactly why it's the interesting part.

That's Hybrid, and in Part 2 a single cheap question turns that coin-flip into a data-driven decision. Then there's the permission gate that runs in a spot that makes security people flinch on first read — after the rank — and why it's actually safe. Finally, the one place all of this landed on my side of the stack: a null relevance score that the frontend has to treat as a first-class state, not a missing number.