Thread Transfer

Hybrid search in 2025: Combining semantic + BM25 for production

Semantic search misses exact matches. BM25 misses intent. Hybrid search gets you both. Implementation guide inside.

Jorgo Bardho

Founder, Thread Transfer

March 22, 2025•11 min read

hybrid searchBM25semantic searchproduction RAG

Semantic search misses exact matches. BM25 misses intent. Hybrid search combines both and delivers 30-40% better retrieval accuracy than either alone. Teams shipping production RAG in 2025 don't pick one or the other—they fuse results from both and let re-ranking decide. This guide walks you through architecture, fusion methods, and tuning strategies that work at scale.

The limits of semantic-only and keyword-only search

Semantic search (vector embeddings) captures intent and meaning. User asks "How do I reset my password?" and it retrieves docs about account recovery, even if they don't use the word "password." But semantic search fails on:

Exact matches (product names, error codes, API endpoints like /api/v2/users)
Rare terms that don't embed well ("OAuth2" vs "authentication")
Queries with specific constraints ("Python 3.11 bug" should match "3.11" exactly, not "3.10" or "3.12")

BM25 (keyword search) nails exact matches. User searches "error 429" and BM25 finds every doc containing "429." But BM25 fails on:

Synonym mismatches ("API limits" won't match docs that say "rate limiting")
Intent-based queries ("Why can't I log in?" returns nothing unless docs use those exact words)
Typos and variations ("authetication" vs "authentication")

Hybrid search runs both retrieval methods in parallel, fuses results, and re-ranks. You get the precision of BM25 plus the recall of semantic search.

Hybrid architecture: Dual-index retrieval + fusion

The canonical hybrid search stack looks like this:

Dual indexing: Chunk your docs once. Index chunks in both a vector DB (Pinecone, Weaviate, Qdrant) and a keyword search engine (Elasticsearch, OpenSearch, or in-memory BM25 via Tantivy/Lucene).
Parallel retrieval: At query time, embed the query and retrieve top-k from vector DB (e.g., k=20). Simultaneously, run BM25 on the keyword index and retrieve top-k (e.g., k=20).
Result fusion: Merge the two result sets using RRF (Reciprocal Rank Fusion), weighted sum, or linear combination. This produces a single ranked list of up to 2k candidates (deduplicated).
Re-ranking: Pass the fused list through a cross-encoder (Cohere Rerank, BGE Reranker, or custom model) to compute query-document relevance scores. Return top-5 or top-10.

Most vector DBs now support hybrid search natively (Weaviate, Qdrant, Pinecone). If yours doesn't, run BM25 separately and fuse in your application layer.

Fusion methods: RRF vs weighted sum

Reciprocal Rank Fusion (RRF): For each result, compute 1 / (rank + k) where k=60 is standard. Sum scores from semantic and BM25 lists. Rank by total score. RRF is parameter-free, works out of the box, and performs well across diverse query types. Start here.

Weighted linear combination: Assign weights to semantic vs BM25 scores (e.g., 0.7 semantic + 0.3 BM25). Requires normalization since semantic scores (cosine similarity) and BM25 scores (TF-IDF) are on different scales. Use min-max normalization or z-score normalization. Tune weights on a test set. More flexible than RRF but requires tuning.

Which to use: Start with RRF. If your test set shows semantic queries vastly outnumber exact-match queries (or vice versa), tune weights with linear combination. Otherwise RRF is simpler and performs within 2-3% of optimal weights.

Implementation: From vector-only to hybrid in 4 steps

Step 1: Add BM25 index. If you're using Weaviate or Qdrant, enable hybrid search in config. Otherwise, spin up Elasticsearch or build an in-memory BM25 index with Tantivy. Index the same chunks you already embedded.

Step 2: Implement parallel retrieval. At query time, fire off semantic and BM25 searches in parallel. Retrieve top-20 from each. Wait for both to return (usually 50-150ms each).

Step 3: Fuse with RRF. Deduplicate results by chunk ID. Apply RRF scoring. Sort by fused score. Take top-20 candidates.

Step 4: Re-rank. Pass top-20 to Cohere Rerank API or run a local cross-encoder. Return top-5. Measure retrieval precision on your test set. Expect 25-35% improvement over semantic-only.

Tuning guide: When to favor semantic vs keyword

Hybrid search isn't one-size-fits-all. Tune fusion weights based on query distribution:

80%+ queries are intent-based (how-to, why, what-is): Weight semantic higher (0.7 semantic, 0.3 BM25). Or stick with RRF.
30%+ queries contain exact terms (error codes, product names, version numbers): Weight BM25 higher (0.5 semantic, 0.5 BM25) or tune per-query with a classifier.
Query length matters: Short queries (1-3 words) benefit more from BM25. Long queries (8+ words) benefit more from semantic. Consider dynamic weighting based on query length.

Measure precision@5 and recall@10 on your test set for different weight configurations. Pick the config that maximizes F1 score.

Production considerations: Cost, latency, and index drift

Latency: Parallel retrieval adds minimal latency (both searches run concurrently). Re-ranking adds 100-200ms for Cohere API or 20-50ms for local cross-encoder. Total retrieval time: 200-400ms, well within acceptable bounds for most use cases.

Cost: You pay for two indexes (vector + keyword). Vector storage is ~$0.10-0.50 per 1M chunks/month. Elasticsearch/OpenSearch is similar. Re-ranking via Cohere costs ~$1 per 1k searches. Budget $50-200/month for a 100k chunk corpus with moderate query volume.

Index drift: When you update docs, update both indexes. Most teams batch updates daily or hourly. Track version hashes to ensure vector and keyword indexes stay in sync. Out-of-sync indexes cause fusion to return stale results.

Case study: 38% accuracy improvement with hybrid + re-ranking

A SaaS team with 50k docs (Confluence + Zendesk) was running semantic-only RAG. Precision@5 was 52%. They shipped hybrid search + Cohere re-ranking:

Added Elasticsearch for BM25 (same 50k chunks)
Implemented RRF fusion (no weight tuning)
Added Cohere Rerank for top-20 → top-5
Re-measured: Precision@5 jumped to 72% (38% relative improvement)

Total implementation time: 2 weeks. Ongoing cost: $80/month (Elasticsearch + Cohere). ROI: Support ticket deflection increased from 42% to 61%, saving ~$15k/month in support costs.

Bottom line: Hybrid search is table stakes for production RAG in 2025. Semantic-only and BM25-only both leave accuracy on the table. Combine them, fuse with RRF, and re-rank. Expect 30-40% better retrieval with 2 weeks of engineering work.

Learn more: How it works · Why bundles beat raw thread history