Building a multi-agent document-search copilot — Part 1: muddy results, and one strategy per query

The first version ranked documents badly — and worse, it ranked them badly in a way that looked fine on the architecture diagram. Those are the bugs that get under my skin: every box is green, every arrow points the right way, and the answer is still wrong.

We were building a chat copilot over a regulated document store — the kind where a user types "show me my effective SOPs about equipment cleaning" and expects the right handful of documents back, ranked, with an excerpt and a reason. The v1 design did the obvious thing: run two retrieval lanes in parallel — a structured metadata lane and a semantic content lane — union the hits, rerank the union, render. Clean diagram. Muddy results. We'd open the demo, the pipeline would light up green end to end, and the list that came back was mush: the metadata rows polluted the semantic rank, the relevance scores stopped meaning anything, and there was no clean ordering left to show the user. The architecture was elegant. The experience was not.

This is a two-part story of how that became v2: one strategy per query, never mixed, a router that's a single structured-output call, and a Hybrid path that peeks at the data before it decides how to retrieve. It's an architecture post, so I'll keep it anchored in the specific decisions that actually moved — not a generic "how to build RAG" walkthrough. Part 1 (this post) is the problem and the first two reframes. Part 2 is the hard case — Hybrid — and the permission model.

🗺️ The series at a glance

This is Part 1 of 2.

Part 1 (this post) — the problem and the first two reframes:

🌫️ v1 fused two parallel lanes into one rerank → muddy. Metadata rows have no text; a reranker scores text; fusing them corrupts the one number the UI depends on.
📞 The router is one Bedrock structured-output call, not three sequential hops. Route + rewritten query + strategy + filters come back together, with a deterministic fallback. The catch: the merged task got too hard for a small model, so it runs on a bigger one.
🎯 v2 picks exactly one shape per query: MetadataOnly, ContentOnly, or Hybrid. The lanes are never unioned or cross-scored.

Part 2 — the hard case and the safety model:

⚖️ Hybrid is adaptive. It peeks the filtered-universe size once (threshold 1000) and routes on selectivity: small set → scope the semantic query to those ids (filter-first, authoritative); big set → rank unscoped and drop off-filter hits locally (rank-then-filter, harmless recall ceiling).
🔒 The view-permission gate runs after rerank, and that's safe — because tenant isolation is enforced earlier, at retrieval, from the token.

🧭 The flow, end to end

Here's the whole turn. A message comes in, the supervisor routes it, the search graph runs one retrieval shape, reranks, gates on permissions, and finalizes for the UI:

                 ┌─────────────────────────────────────┐
   user turn ──► │ PLANNER (one Bedrock call)           │
                 │ route + rewrite + strategy + filters │
                 └──────────────┬──────────────────────┘
                                │ deterministic fallback if it fails
                                ▼
         ┌─────────────────── retrieval graph ───────────────────┐
         │   classify_intent                                     │
         │        │  (NoMatch / NeedsClarification skip ahead)   │
         │        ▼                                              │
         │   retrieve        ── picks ONE: Metadata / Content /  │
         │                      Hybrid (adaptive on selectivity) │
         │        ▼                                              │
         │   rank_results    ── Cohere over content; metadata    │
         │                      passes through unscored          │
         │        ▼                                              │
         │   access_gate     ── fail-closed VIEW gate on the     │
         │                      content lane only                │
         │        ▼                                              │
         │   format_response ── floor, shape, SSE                │
         └───────────────────────────────────────────────────────┘

The node order is exactly that, straight from the graph definition:

builder.add_edge(START, "query_rewriting")
builder.add_edge("query_rewriting", "intent_detection")
builder.add_conditional_edges(
    "intent_detection",
    _route_after_intent_detection,
    {"retrieve": "retrieve", "finalize_results": "finalize_results"},
)
builder.add_edge("retrieve", "rerank")
builder.add_edge("rerank", "permission_filter")
builder.add_edge("permission_filter", "finalize_results")

Four decisions made this work. Two of them are in this post; the other two are Part 2. I'll take them in order.

1️⃣ The router is one call, not three

The v1 router ran three sequential Bedrock hops per turn: one to pick a route, one to rewrite the user's query into a clean retrieval string, one to classify intent and extract filters. Three round trips, in series, before any retrieval even started. Each one waits on the last — and the user just watches a spinner the whole time.

My pushback in review was simple: don't run sequential model calls when one call can return the whole decision. A router's job is to emit a structured plan. There's no reason route, rewrite, and intent need to be three separate inferences — they're three fields of one object. So we collapsed them. The supervisor now makes a single structured-output call that returns a typed plan, which gets adapted into the downstream routing contract:

result = await bedrock_client.invoke_with_structured_response_with_fallback_async(
    messages=[SystemMessage(content=system), HumanMessage(content=message)],
    response_structure=Plan,
    chat_model_chain=model_chain,
    max_tokens=1024,
    temperature=0.0,
)

One call, temperature=0.0, and everything the downstream graph needs comes back together: the route, the rewritten query, the strategy, the structured filters. The RouterDecision it adapts into is the contract the search graph consumes:

class RouterDecision(BaseModel):
    route: Literal["documents.search", "documents.doc_context", "general.help"]
    rewritten_query: str = Field(default="")
    strategy: SearchStrategy  # MetadataOnly | ContentOnly | Hybrid | NoMatch | NeedsClarification
    search_value: str = Field(default="")
    filters: list[DocumentFilter] = Field(default_factory=list)
    ...

The twist that bit back. Merging three easy classifications into one harder one means the model now has to do all of it in a single pass — and the small/cheap model we wanted started getting it wrong. So the router moved up to a stronger (and slower) model. The "3 calls → 1 call" math promised a big latency win; the model upgrade promptly taxed a chunk of it back. (Presenting a "latency win" that your own model bump immediately claws into is a humbling little moment — I recommend the experience to no one.) We shipped anyway, because correctness moved the right way and the architecture got dramatically simpler — and because of the second non-negotiable below.

💡 The reusable lesson: a call-merge latency win can be partly clawed back by the accuracy upgrade the merge forces. Budget for that. And always keep a deterministic fallback.

The deterministic fallback is the part I'd flag for anyone copying this. The structured call returns None on any failure (the model is down, the output won't parse, the plan isn't exactly one step), and the caller drops to a non-LLM router:

fallback = route_request(body, intent_resolver=resolve_doc_intent)
plan, router_usage = await supervise_turn(...)
routing = plan_to_router_decision(plan)
# routing is None -> use fallback

The copilot never hangs on a flaky router call. If the smart path can't answer, a dumb-but-reliable path does. That's what lets you run the router on a heavier model without making it a single point of failure.

2️⃣ One strategy per query — the reframe that fixed the muddy results

This is the heart of the v2 change, and the part I argued hardest for — loudly, in more than one meeting.

The v1 retrieval ran two member-scoped lanes in parallel — an OpenSearch hybrid lane (vector + BM25) and a structured metadata lane — and fused them into one set before reranking. The intuition was "more recall is better, let the reranker sort it out." It doesn't work, and the reason is specific: a metadata hit and a content hit are not the same kind of object.

A cross-encoder reranker scores text. You hand it a query and a list of passages, it returns a relevance number per passage. A content chunk has text. A metadata row — "status = effective, author = me" — has no passage. When you stuff that row into the reranker by stringifying a title or a key-id, you get back a number that means nothing, on the same 0-1 scale as the real text scores. Nothing downstream can tell the calibrated score from the garbage one. The rank looks plausible and is quietly wrong.

So v2 picks exactly one retrieval shape per query and runs only that. The strategy comes from the router as a literal:

SearchStrategy = Literal[
    "MetadataOnly", "ContentOnly", "Hybrid", "NoMatch", "NeedsClarification"
]

Strategy	What it means	How it's retrieved	Has a relevance rank?
`MetadataOnly`	structured filters only ("my effective SOPs")	Documents service, filtered rows in service order	No — a satisfied filter isn't a relevance signal
`ContentOnly`	topic search ("policy on cleaning between batches")	OpenSearch hybrid over chunks, unscoped	Yes — Cohere rerank
`Hybrid`	topic and filters ("effective docs about cleaning")	adaptive (Part 2)	Yes — Cohere rerank
`NoMatch`	not a document query (greeting, capability question)	skips retrieval entirely	n/a
`NeedsClarification`	a document query too vague to retrieve ("show me stuff")	skips retrieval, asks a clarifying question	n/a

The lanes are never unioned or cross-scored anymore — retrieve says so in plain terms:

# The lanes are never unioned or cross-scored: a content turn returns content
# candidates, a metadata turn returns metadata candidates.

There's a nice side effect in the graph: NoMatch and NeedsClarification short-circuit straight to finalize_results, skipping retrieval, rerank, and the permission gate. A greeting shouldn't make the UI flash "Searching... / No matches" and shouldn't cost three round trips.

def _route_after_intent_detection(state) -> str:
    return ("finalize_results"
            if state.get("intent_strategy") in ("NoMatch", "NeedsClarification")
            else "retrieve")

The honest cost. Going single-strategy meant deprecating the old global keyword lane entirely. The practical fallout: custom-field values are no longer reachable as a filter, because the content-keyword fold that used to (badly) cover them is gone. That's a real regression on a narrow feature, parked until a dedicated Documents endpoint for custom fields lands. It stings to ship a regression on purpose — but I'd take a clean rank with one honest, documented gap over a muddy rank that hides a dozen.

💡 If you take one thing from this post: don't fuse object types into a single rerank set just to maximize recall. Pick the retrieval shape that matches the query, and run only that. Recall you can't rank cleanly is recall you can't show.

🎬 To be continued

So far the story has a tidy shape. One router call instead of three. One retrieval strategy per query instead of a fused blend. Pick MetadataOnly, or ContentOnly, and run only that. Clean.

But I skipped the strategy that refuses to be tidy — the one where the user genuinely wants both at once. "Effective SOPs about equipment cleaning" is a topic and a filter, and you can't honor it by picking a single lane. Running the filter and the rank in parallel is fast, but you have to reconcile two sets. Running them in sequence is precise, but slow. There's no single right answer — which is exactly why it's the interesting part.

That's Hybrid, and in Part 2 a single cheap question turns that coin-flip into a data-driven decision. Then there's the permission gate that runs in a spot that makes security people flinch on first read — after the rank — and why it's actually safe. Finally, the one place all of this landed on my side of the stack: a null relevance score that the frontend has to treat as a first-class state, not a missing number.

推荐订阅源

DEV Community