← All artifacts · All patterns · Retrieval & RAG

Hybrid Search

aka BM25 + Dense, Lexical + Semantic Retrieval

category: retrieval · status: mature

Intent

Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.

Example scenario

A coding-assistant searches its codebase for 'how do we authenticate with Stripe?' Pure semantic search misses files that mention 'stripe-api-key' verbatim; pure keyword search misses files that talk about 'payment processor authentication'. Hybrid search runs both at once: a keyword scorer catches the exact tokens, an embedding scorer catches the conceptual matches, and a fusion step blends the two ranked lists.

Context

Queries vary: some are keyword-matchable (exact identifiers, names), others are semantic; pure dense or pure sparse retrieval misses one or the other.

Problem

Dense retrieval misses exact matches; sparse retrieval misses paraphrase. Each alone leaves recall on the table.

Forces

Score fusion (RRF, weighted sum, learned) is a design choice.
Two indexes mean two pipelines to maintain.
Tuning fusion weights is empirical and corpus-specific.

Solution

Index the corpus twice: BM25 for sparse, dense embeddings for semantic. At query time, retrieve top-k from each, fuse with Reciprocal Rank Fusion or weighted aggregation. Pass the fused top-N forward (typically into a reranker). Do not weight raw scores directly; use rank-based fusion (RRF) or score-normalised aggregation, since BM25 and dense scores live on incompatible scales.

Diagram

flowchart LR
  Q[Query] --> S[Sparse retrieval BM25]
  Q --> D[Dense retrieval embeddings]
  S --> F[Fusion / Reciprocal Rank Fusion]
  D --> F
  F --> R[Top-k results]

Hybrid Search runs sparse and dense retrievers in parallel and fuses their rankings before downstream stages.

Variants

Reciprocal Rank Fusion (RRF)

Each retriever produces a ranked list; final score for a doc is the sum of 1/(k+rank) across lists. Tunes via the constant k (~60 in published work).

Distinguishing factor: rank-based fusion, no score normalisation

When to use: Default when retrievers produce non-comparable scores (BM25 floats vs cosine similarities).

Weighted score fusion

Each retriever returns scores normalised to [0,1]; final score is a tunable convex combination (e.g. 0.4 * sparse + 0.6 * dense).

Distinguishing factor: score-based fusion, requires normalisation

When to use: When retrievers produce comparable scores or you have an eval set to tune weights against.

Router-based hybrid

A classifier inspects the query and routes to sparse OR dense retrieval (not both). 'How do I X' goes dense; 'order #4471' goes sparse.

Distinguishing factor: one retriever per query

When to use: Latency budget cannot afford two retrievers and the query mix is bimodal.

Applicability

Use when

Queries mix semantic intent with rare tokens (codes, IDs, proper nouns) that embeddings miss.
The corpus is heterogeneous enough that one retriever loses recall on part of it.
Latency budget tolerates two retrievers plus a fusion step.

Do not use when

The corpus is uniformly conceptual; dense alone is enough.
The corpus is uniformly keyword-driven; BM25 alone is enough.
Sub-50ms p95 retrieval is required and the second retriever blows the budget.

Constrains

The retrieval set is the fusion of sparse and dense top-k; neither alone is the input to downstream stages.

Consequences

Benefits

Recall improvement over either alone, especially for mixed-vocabulary corpora.
Robust to embedding model weaknesses on rare terms.

Liabilities

Two indexes to keep in sync.
Fusion tuning is empirical.

Known Uses

Anthropic Contextual Retrieval
Most production RAG (Pinecone, Weaviate, Elastic Hybrid)

Related Patterns

specialises → naive-rag
composes-with → cross-encoder-reranking
composes-with → contextual-retrieval

References

Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods — Cormack, Clarke, Buettcher (2009) paper