Why hybrid search is the boring default we keep recommending

A founder we work with had been debugging a confusing failure for two weeks. The internal AI assistant could not find content about "GPT-4o pricing" even though the exact phrase appeared in three different chunks in the vector database. Asking for "OpenAI model costs" returned the right results. Asking for "GPT-4o pricing" returned a discussion of GPT-3 from eighteen months ago.

This is the failure mode every vector-only RAG system has. The team was about to swap embedding models. The actual fix was three lines of configuration.

What vector search is actually good at

Dense vector search is excellent at semantic similarity. "How do I cancel my subscription" matches a document titled "Account termination procedure" because the embedding model has learned the two phrases are semantically close. Synonyms, paraphrases, cross-language matches, all of these work because the model has seen enough text to know they mean the same thing.

What dense vectors are not good at is exact-token matching for terms the embedding model has not seen often, or has seen in different contexts. "GPT-4o" is a recent product name that probably did not exist when the embedding model was trained. The model treats it as a sequence of subword tokens and does not have a strong concept of it as a unit. The cosine distance between "GPT-4o pricing" and "GPT-3 cost" can easily be smaller than the distance between "GPT-4o pricing" and "GPT-4o pricing" if the latter appears in a chunk surrounded by very different context.

This is true for:

Product names, especially recent ones (Claude 4, Llama 3.2, GPT-4o)
Internal acronyms and codes (X100, PRJ-2024-Q3)
Function and identifier names in code documentation
Error codes and stack trace fragments
Rare technical terms in narrow domains

For these queries, BM25 (the same keyword scoring algorithm Lucene and Elasticsearch have used for two decades) outperforms vector search by a wide margin.

What hybrid search actually is

Hybrid search runs both retrievers and combines the results. The vector retriever returns the top-k by cosine similarity. The BM25 retriever returns the top-k by token-overlap score. A merge function (most commonly Reciprocal Rank Fusion) reranks the union into a single ordered list.

Reciprocal Rank Fusion is the boring choice for a reason. It is parameterless: no weights to tune, no normalization to fight with. The score for each document is the sum of 1 / (k + rank) across the retrievers it appears in, where k is a small constant (typically 60). Documents that rank well in both retrievers float to the top. Documents that only rank well in one still appear in the merged list but lower.

The teams that try to weight vector and BM25 explicitly (say, "0.7 vector + 0.3 keyword") usually spend a week tuning the weights and then realize RRF was within 2% recall and required no tuning at all.

What we ship by default

For every new RAG project Sapota starts, the default retrieval config is hybrid with RRF. Not vector-only. Not BM25-only. Both, fused.

The implementation is straightforward in the vector databases worth using:

Qdrant has supported hybrid search natively since version 1.10. You define a sparse vector field alongside the dense vector field, populate both at index time, and query both at search time.
Weaviate has built-in hybrid search with a single hybrid query parameter.
Elasticsearch with the dense_vector field type can do this if you do not mind running Elasticsearch.

For teams already on a vector database that does not support hybrid (older Pinecone, Chroma, FAISS), we add a separate BM25 index using a lightweight library (rank_bm25 in Python is fine for under a million documents) and merge in application code. The latency overhead is around 30%. Worth it.

When pure vector is enough

We do not always recommend hybrid. The cases where vector-only is genuinely the right call:

The corpus is purely natural-language prose with no rare identifiers (general knowledge content, customer service FAQs that do not mention product names, internal policy documents).
The query distribution is genuinely paraphrase-heavy (the user asks the same thing fifty different ways and expects the system to know they mean the same thing).
The corpus is small enough (under 10,000 documents) that BM25 noise outweighs its precision benefit.

These are real cases. They are also rarer than teams assume. Most production corpora have some product names, some technical terms, some rare-but-critical identifiers. Hybrid handles all of them without sacrificing the semantic strength of dense vectors.

What changed for the founder

The diagnosis took fifteen minutes once we saw the eval failures. The fix was three changes:

Added a sparse vector field to the existing Qdrant collection.
Re-indexed using both the existing embedding model and a BM25 sparse encoder (Qdrant supports Bm25 as a built-in sparse model).
Switched the application's search call from a single dense query to a hybrid query with RRF fusion.

Total ship time: half a day. Recall on the failing query class went from below 30% to above 90%. Recall on the queries that had been working stayed flat. Latency increased from 80ms to 110ms at the p50.

The founder asked why this had not been the default. The answer is that most tutorials and getting-started guides for vector databases lead with pure dense vector search because it is the new and exciting capability. BM25 is twenty years old and not exciting. Combining them is not worth a blog post on the vendor's own site, so most teams never learn that the combination is what production actually wants.

A note on sparse encoders

Beyond classic BM25, there is a newer class of "learned sparse" encoders (SPLADE, BGE-M3 sparse mode) that produce sparse vectors via a transformer model. These outperform BM25 on most benchmarks and are worth considering for a v2 system.

For v1, plain BM25 is fine. The marginal improvement from a learned sparse encoder is smaller than the improvement from adding any sparse retrieval at all.

If your search is missing what should be obvious

If your AI assistant cannot find content that you can prove is in the index by searching for the exact phrase, the failure is almost always at the retrieval layer, not the model. Sapota offers a one-week retrieval audit that produces the eval set, identifies the failing query classes, and ships the hybrid configuration as a working PR.

Reach out via the AI engineering page and send three example queries that are failing in production. The diagnosis usually takes one call.

推荐订阅源

DEV Community