I added a reranker to my RAG pipeline — it broke everything, then I fixed it

In v2 I added hybrid retrieval (FAISS + BM25) to fix keyword blindspots. All 19 test questions passed. The next item on my list was a cross-encoder reranker for better precision.

The idea is standard: over-fetch candidates, rerank with a smarter model, keep the top-k. Every RAG tutorial recommends it. It took me 20 minutes to implement and immediately broke 2 of my 19 tests.

Here's what went wrong and the strategy I landed on.

What a cross-encoder does (and why it's better)

In v2, retrieval uses bi-encoders — the query and each chunk are embedded independently, then compared by cosine similarity. Fast, but the model never sees query and chunk together.

A cross-encoder is different. It takes the (query, chunk) pair as a single input and outputs a relevance score. It can attend to both simultaneously — word-level interactions, negation, paraphrasing. Much more accurate, but too slow for first-stage retrieval because you'd need to score every chunk in the index.

The standard two-stage pattern:

Stage 1: cheap retrieval (FAISS + BM25) → broad candidate set
Stage 2: cross-encoder reranks candidates → precise top-k → LLM

The implementation (the easy part)

New file — app/reranker.py:

from sentence_transformers import CrossEncoder

RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"

_reranker = None

def get_reranker():
    global _reranker
    if _reranker is None:
        _reranker = CrossEncoder(RERANKER_MODEL_NAME)
    return _reranker

def rerank(query, retrievals, top_k):
    model = get_reranker()
    pairs = [[query, r.chunk.text] for r in retrievals]
    scores = model.predict(pairs)
    for r, score in zip(retrievals, scores):
        r.score = float(score)
    ranked = sorted(retrievals, key=lambda r: r.score, reverse=True)
    return ranked[:top_k]

And in main.py, over-fetch then rerank:

# Before (v2): retrieve top_k directly
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)

# After (v3): over-fetch, then rerank
candidates = store.search(query_vec, top_k=req.top_k * 2, query_text=req.question)
retrieved = rerank(req.question, candidates, top_k=req.top_k)

No new dependency — cross-encoder/ms-marco-MiniLM-L-6-v2 works through sentence-transformers which was already installed. The model is ~80MB, runs on CPU.

I ran the eval. Two tests broke.

What broke

Question:  Who is the CEO of Zentara Robotics?
Expected:  ['Iris Kallas']
Got:       I couldn't find that in the document.

Question:  How many employees does Zentara have?
Expected:  ['287']
Got:       I couldn't find that in the document.

The exact same two questions that failed in v1 with pure FAISS. Hybrid retrieval fixed them. The reranker un-fixed them.

Why the cross-encoder hates tables

The CEO chunk looks like this:

Company: Zentara Robotics | CEO: Iris Kallas | Employees: 287 | Founded: 2018 ...

Dense. Tabular. Eight facts crammed together.

The cross-encoder (ms-marco-MiniLM-L-6-v2) was trained on MS MARCO — a web search dataset where passages are natural language paragraphs. When it sees a fact-packed table row as a "passage" for the query "Who is the CEO?", it scores it low. It doesn't look like a good answer, even though it contains the answer.

Meanwhile, hybrid retrieval ranked this chunk #1 — BM25 matched "CEO" exactly and RRF boosted it. The cross-encoder then threw it away.

What I tried (and why it failed)

I went through 7 approaches before finding one that worked. Here's the progression:

#	Approach	Result
1	Pure CE rerank	CE buries table chunks
2	Bigger candidate pool (15)	More candidates = more competition
3	Score blending (0.7 CE + 0.3 RRF)	CE score is so negative it still dominates
4	Score blending (0.5 + 0.5)	Still not enough
5	RRF fusion of CE + first-stage rankings	K=60 makes all rank contributions ~equal, CE rank wins
6	Weighted RRF (2x first-stage)	Still too flat with K=60
7	Smaller pool (top_k * 2)	CE still pushes table chunks out

The core issue: the cross-encoder's score for table chunks is so negative that no amount of score blending or rank fusion can compensate. It's not a "this chunk ranks slightly lower" problem — it's a "the model actively rejects this format" problem.

What actually worked: guaranteed slots

The insight: the first-stage results are already good. Hybrid retrieval passed all 19 tests. The reranker should improve those results, not override them.

The strategy:

top_k = 3:  guaranteed slots = 2 (from first-stage)  +  1 CE pick
top_k = 5:  guaranteed slots = 4 (from first-stage)  +  1 CE pick

The top first-stage results are preserved. The cross-encoder only gets to fill the last slot from the remaining candidates. Here's the final implementation:

def rerank(query, retrievals, top_k):
    if not retrievals or top_k >= len(retrievals):
        return retrievals

    n_guaranteed = top_k - 1
    n_ce_slots = 1

    guaranteed = retrievals[:n_guaranteed]
    remaining = retrievals[n_guaranteed:]

    if remaining:
        model = get_reranker()
        pairs = [[query, r.chunk.text] for r in remaining]
        scores = model.predict(pairs)
        for r, score in zip(remaining, scores):
            r.score = round(float(score), 4)
        remaining.sort(key=lambda r: r.score, reverse=True)

    return guaranteed + remaining[:n_ce_slots]

The CEO chunk (first-stage #1) is always guaranteed. The employee chunk (~rank 3-4 at top_k=5) is also preserved. The CE still adds value by selecting the most relevant candidate for the final slot.

Result: 19/19 passing.

The pipeline now

PDF ─► extract text ─► chunk ─► embed (MiniLM-L6-v2)
                                        │
                                        ▼
question ─► FAISS + BM25 (2× top_k candidates, RRF fused)
         ─► cross-encoder reranks remaining candidates
         ─► guaranteed first-stage slots + 1 CE-picked slot
         ─► top_k chunks ─► LLM ─► answer + sources

Three stages of retrieval now: vector search, keyword search, cross-encoder. Each catches something the others miss.

What I learned

Rerankers aren't drop-in improvements. Every RAG tutorial shows "add a cross-encoder, get better results." In practice, cross-encoders trained on natural language passages can actively hurt retrieval quality on structured or tabular content.
Your eval set is your safety net. Without the 19-question eval harness, I would've shipped this and had no idea I'd regressed on 2 questions. The eval caught it in seconds.
Guaranteed slots > score blending. I tried 7 different ways to blend CE and first-stage scores. None worked because the CE's score for table chunks was so negative it dominated every blend. The fix wasn't mathematical — it was structural: protect what's already working, let the CE improve the margins.
The retriever still matters most. v1 → v2 (adding BM25) was the biggest accuracy jump. v2 → v3 (adding the reranker) was a precision refinement that nearly caused regressions. Invest in your first-stage retrieval before reaching for rerankers.

What's next

Streaming responses
Conversation memory
Possibly a Streamlit UI

Try it yourself

v3 (reranker): github.com/santanu2908/chat-with-pdf-rag
v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag/tree/v2
v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the sample PDF, and try "Who is the CEO?" — it still works, even with the reranker.

If you've hit similar issues with cross-encoders on structured content, I'd love to hear your approach.

I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.

推荐订阅源

DEV Community