Hybrid Search: BM25 Meets Dense Vectors

Dense vector search, the thing this whole series has been building on, has a stupid failure mode: it can't find an exact word. Search a vector index for error code SQLSTATE 23505 and it will cheerfully return passages about database errors in general, ranked by how vibes-similar they feel — while the one chunk that literally contains 23505 sits in eighth place, or off the list entirely. The model that's so good at understanding meaning is oddly bad at matching a string.

The fix isn't a better embedding. It's admitting that the twenty-year-old keyword search you skipped past is still better at some things than the neural network, and running both.

Stop treating lexical search as obsolete

BM25 — the ranking function under Elasticsearch, Lucene, and basically every search bar built before 2020 — scores a document by how often the query's terms appear in it, damped so the tenth occurrence counts less than the first, and weighted so rare terms count more than common ones. That's it. No semantics, no embeddings, no GPU. It does not know that "car" and "automobile" are related.

And it is unbeatable at the things dense retrieval fumbles: exact identifiers, error codes, product SKUs, function names, proper nouns, legal citations, the rare technical term that appears nowhere in the embedding model's training distribution. If the user's word and the document's word are the same string, BM25 finds it with certainty. No approximation, no "close in vector space." Match or no match.

The two methods fail in opposite directions, and that's the whole reason to combine them:

Dense retrieval generalizes. "How do I cancel my plan?" finds a passage titled "Terminating your subscription" with zero shared keywords. It also drifts — returns topically-adjacent fluff, misses exact strings.
Lexical retrieval is literal. It nails the exact string and whiffs the moment the user paraphrases or the document uses a synonym.

Run them together and each one's strength covers the other's blind spot. That's hybrid search.

Don't try to add the scores together

The obvious idea — normalize both scores to 0–1 and average them — is a trap, and it's worth understanding why so you don't reinvent it.

BM25 scores are unbounded and corpus-dependent; a great match might score 18, or 4, depending on term rarity. Cosine similarities live in a tight band, often clustered between 0.6 and 0.9 where everything looks similar. The two distributions have nothing to do with each other. Min-max normalize them and the result swings wildly with whatever the top and bottom scores happened to be in this query's result set. You're averaging two numbers that aren't measuring the same thing in the same units. Sometimes it works; you won't know when it doesn't.

Fuse on rank, not on score

The trick that's held up since 2009 is Reciprocal Rank Fusion, and its appeal is that it throws the scores away entirely and keeps only the ordering. Each list votes for a document based on where it ranked — first place is worth more than tenth — and the votes add up across lists.

BM25 and dense search each produce a ranked list, fused by Reciprocal Rank Fusion — Two retrievers that fail in opposite directions, merged on rank — not score.

The formula is one line. A document's fused score is the sum, over every list it appears in, of 1 / (k + rank), where rank is its position in that list and k is a smoothing constant — 60 is the value from the original paper and the one everyone still uses.

def reciprocal_rank_fusion(ranked_lists, k=60):
    scores = {}
    for ranking in ranked_lists:          # each is an ordered list of doc ids
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

Why this beats score-averaging: it's scale-free. A document doesn't need to win on raw score; it needs to rank high in multiple lists. A chunk that lands top-5 in both lexical and dense search rockets to the top — that's the strong signal, agreement across two methods that fail differently. A chunk that's first in one list and absent from the other still scores respectably. The k term keeps the top ranks from utterly dominating, so a clear #1 doesn't bury every consensus pick below it. No normalization, no per-query tuning, no distributions to wrangle.

If you do want a thumb on the scale — weight dense higher for conceptual corpora, lexical higher for code or identifiers — use a weighted RRF, multiplying each list's contribution by a coefficient. But reach for plain RRF first. Most of the time the unweighted version is already most of the win, and you should make it prove it needs tuning.

Build it without standing up a second database

You don't necessarily need Elasticsearch alongside your vector store. By late 2025 the lines have blurred:

Postgres runs both sides in one box — pgvector for dense, the built-in full-text search (or the paradedb/BM25 extensions) for lexical — and you fuse in SQL.
Qdrant, Weaviate, Milvus ship native hybrid search: you hand them the query, they run sparse and dense internally and return a fused list.
OpenSearch / Elasticsearch went the other way, adding vector fields to the lexical engine they already were.

Whichever you pick, the architecture is the same: two retrievers, RRF in the middle, one merged list out. Retrieve maybe the top 20–50 from each method before fusing — you want enough depth that a document buried in one list still gets a chance to be rescued by the other.

When it's worth it, and when it isn't

Hybrid search is close to a default for general corpora, and the gain is largest exactly where pure vector search hurts most: documents thick with names, codes, jargon, and acronyms — which describes most real enterprise content. Anthropic's contextual-retrieval work folded BM25 into its pipeline for precisely this reason and reported it as a meaningful chunk of the improvement.

It's not free. You maintain two indexes, fuse on every query, and add a little latency. For a small corpus of clean prose where users ask in full, natural sentences, dense-only might already be enough, and the second index is complexity you don't need yet. And hybrid search still only retrieves — it gets the right chunks into the candidate set. Ordering those candidates so the best one lands first is a different job, and it belongs to the reranker. That's the next post.

The mindset shift is the takeaway: lexical search isn't the old thing you replaced with embeddings. It's the other half of a pair. The systems that retrieve well in 2025 didn't pick the neural method over the keyword method — they stopped pretending it was a choice.