Hybrid Search

aka BM25 + Dense, Lexical + Semantic Retrieval

category: retrieval · status: mature

Intent

Combine sparse lexical retrieval (BM25) with dense vector retrieval and fuse the results.

Example scenario

A coding-assistant searches its codebase for 'how do we authenticate with Stripe?' Pure semantic search misses files that mention 'stripe-api-key' verbatim; pure keyword search misses files that talk about 'payment processor authentication'. Hybrid search runs both at once: a keyword scorer catches the exact tokens, an embedding scorer catches the conceptual matches, and a fusion step blends the two ranked lists.

Context

Queries vary: some are keyword-matchable (exact identifiers, names), others are semantic; pure dense or pure sparse retrieval misses one or the other.

Problem

Dense retrieval misses exact matches; sparse retrieval misses paraphrase. Each alone leaves recall on the table.

Forces

Solution

Index the corpus twice: BM25 for sparse, dense embeddings for semantic. At query time, retrieve top-k from each, fuse with Reciprocal Rank Fusion or weighted aggregation. Pass the fused top-N forward (typically into a reranker). Do not weight raw scores directly; use rank-based fusion (RRF) or score-normalised aggregation, since BM25 and dense scores live on incompatible scales.

Diagram

flowchart LR
  Q[Query] --> S[Sparse retrieval BM25]
  Q --> D[Dense retrieval embeddings]
  S --> F[Fusion / Reciprocal Rank Fusion]
  D --> F
  F --> R[Top-k results]

Hybrid Search runs sparse and dense retrievers in parallel and fuses their rankings before downstream stages.

Variants

Reciprocal Rank Fusion (RRF)

Each retriever produces a ranked list; final score for a doc is the sum of 1/(k+rank) across lists. Tunes via the constant k (~60 in published work).

Distinguishing factor: rank-based fusion, no score normalisation

When to use: Default when retrievers produce non-comparable scores (BM25 floats vs cosine similarities).

Weighted score fusion

Each retriever returns scores normalised to [0,1]; final score is a tunable convex combination (e.g. 0.4 * sparse + 0.6 * dense).

Distinguishing factor: score-based fusion, requires normalisation

When to use: When retrievers produce comparable scores or you have an eval set to tune weights against.

Router-based hybrid

A classifier inspects the query and routes to sparse OR dense retrieval (not both). 'How do I X' goes dense; 'order #4471' goes sparse.

Distinguishing factor: one retriever per query

When to use: Latency budget cannot afford two retrievers and the query mix is bimodal.

Applicability

Use when

Do not use when

Constrains

The retrieval set is the fusion of sparse and dense top-k; neither alone is the input to downstream stages.

Consequences

Benefits

Liabilities

Known Uses

Related Patterns

References