Thread Transfer
Reranking Strategies for RAG: Beyond Initial Retrieval
Retrieval finds candidates. Reranking finds the best. Cross-encoders improve precision by 15-30%. Here's the production reranking playbook.
Jorgo Bardho
Founder, Thread Transfer
Retrieval gets you candidates. Re-ranking finds the best ones. In production RAG systems, adding a re-ranking stage improves retrieval accuracy by 20-35% with only 200-500ms added latency. Databricks research shows re-ranking can boost quality by up to 48%, while benchmarks demonstrate 15-40% gains across diverse domains. This guide covers cross-encoders (Cohere, BGE, Mixedbread), late-interaction models (ColBERT), and LLM-based re-rankers—plus when to use each.
Why vector search needs re-ranking
Embedding models (text-embedding-3-small, BGE, Jina) encode queries and documents separately, then match by cosine similarity. This is fast but crude:
- No query-document interaction. Embeddings don't know how a query relates to a document—they just measure "are these similar?"
- Context collapse. A 500-token chunk gets compressed into a 1536-dim vector. Nuance is lost.
- False positives from keyword overlap. "Apple stock price" matches "How to grow apple trees" if both mention "apple" frequently.
Re-rankers fix this by scoring query-document pairs jointly. They see both at once and learn "given this query, how relevant is this specific document?"
Re-ranking architectures: Cross-encoder vs late-interaction vs LLM
Cross-encoder re-rankers
Cross-encoders concatenate query + document and pass them through a transformer together. The model outputs a single relevance score. Examples: Cohere Rerank, BGE-reranker-large, Mixedbread AI.
Pros:
- Highest accuracy (15-25% improvement over embeddings)
- Sees full query-document interaction
- Handles complex reasoning ("Does this doc answer sub-question X?")
Cons:
- Slow (100-200ms per document on large models)
- Can't pre-compute document representations (must score at query time)
Late-interaction re-rankers (ColBERT)
ColBERT encodes query and document into token-level embeddings separately. Interaction happens at scoring time via MaxSim (maximum similarity between query tokens and document tokens).
Pros:
- Pre-computable document embeddings (faster at query time)
- Better than cross-encoders on multi-vector queries
- Supports 89 languages (Jina ColBERT v2)
Cons:
- Slightly lower accuracy than cross-encoders (2-5% gap)
- Requires more storage (full token-level embeddings)
LLM-based re-rankers
Use GPT-4, Claude, or specialized models (RankZephyr, RankGPT) to score documents. Prompt the LLM: "Given query X, rank these documents by relevance."
Pros:
- Handles complex reasoning ("Which doc best explains the tradeoffs?")
- Zero-shot (no training data required)
- Can provide explanations ("Doc 1 is most relevant because...")
Cons:
- Expensive (10-50x cost vs cross-encoders)
- Slow (1-3s for 10 documents)
- Overkill for simple re-ranking
2025 benchmark results: Top re-rankers
Based on 2025 benchmarks:
| Model | MRR@10 | Latency (per doc) | Cost (per 1M queries) | Best For |
|---|---|---|---|---|
| Qwen3-Reranker-8B | 0.89 | 150ms | Self-hosted | Highest accuracy, self-hosted |
| Cohere Rerank 3 | 0.87 | 100ms | $60 | Enterprise, fast API |
| Mixedbread Large | 0.86 | 140ms | Self-hosted | Open-source, high accuracy |
| BGE-reranker-large | 0.85 | 120ms | Self-hosted | Multilingual, self-hosted |
| Jina Reranker v2 | 0.84 | 110ms | $45 | Multilingual (89 langs) |
| FlashRank | 0.78 | 40ms | Self-hosted | Ultra-low latency |
| ColBERT v2 | 0.82 | 80ms | Self-hosted | Balance speed/accuracy |
| RankZephyr-7B (LLM) | 0.91 | 2000ms | Self-hosted | Complex reasoning |
Key finding: Cohere Rerank 3 and Mixedbread Large offer the best accuracy/speed tradeoff for production. Use FlashRank if latency is critical (under 50ms). Use LLM re-rankers only for high-stakes queries where accuracy justifies 10x higher cost.
Implementation: Adding re-ranking to your RAG pipeline
Step 1: Install dependencies
pip install llama-index cohere sentence-transformers rerankersStep 2: Basic re-ranking with Cohere
from llama_index import VectorStoreIndex, Document
from llama_index.postprocessor import CohereRerank
from llama_index.embeddings import OpenAIEmbedding
# Build vector index
documents = [
Document(text="Our refund policy allows 30-day returns for EU customers."),
Document(text="US customers have a 14-day return window."),
Document(text="Refunds are processed within 5-7 business days."),
Document(text="Apple trees require 6-8 hours of sunlight daily."), # False positive
]
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
# Query with re-ranking
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve 10 candidates
node_postprocessors=[
CohereRerank(api_key="your-key", top_n=3, model="rerank-english-v3.0")
]
)
response = query_engine.query("What is the refund policy for European customers?")
print(response.response)
# Correctly filters out "Apple trees" false positiveStep 3: Self-hosted re-ranking with BGE
from sentence_transformers import CrossEncoder
from llama_index.postprocessor import SentenceTransformerRerank
# Load BGE reranker (runs locally)
rerank_model = CrossEncoder("BAAI/bge-reranker-large", max_length=512)
# Wrap in LlamaIndex postprocessor
reranker = SentenceTransformerRerank(
model=rerank_model,
top_n=5
)
# Add to query engine
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[reranker]
)
response = query_engine.query("Refund timeline for EU customers")
print(response.response)Step 4: Unified API with rerankers library
The rerankers library provides a single interface for all re-ranking models:
from rerankers import Reranker
# Switch models with one line
ranker = Reranker("cohere", api_key="your-key") # or "mixedbread", "colbert", "flashrank"
query = "Refund policy for EU customers"
docs = [
"Our refund policy allows 30-day returns for EU customers.",
"US customers have a 14-day return window.",
"Apple trees require 6-8 hours of sunlight daily.",
]
results = ranker.rank(query, docs, top_k=2)
for result in results:
print(f"Score: {result.score:.3f} - {result.text}")Advanced: ColBERT late-interaction re-ranking
ColBERT pre-computes document embeddings, making it faster at query time than cross-encoders:
from rerankers import Reranker
# ColBERT reranker (uses Jina ColBERT v2 by default)
colbert_ranker = Reranker("colbert")
# Index documents once (pre-compute embeddings)
docs = [
"Our refund policy allows 30-day returns for EU customers.",
"US customers have a 14-day return window.",
"Refunds are processed within 5-7 business days.",
]
# Query-time ranking is fast
query = "EU refund policy"
results = colbert_ranker.rank(query, docs, top_k=2)
for result in results:
print(f"Score: {result.score:.3f} - {result.text}")
# Latency: ~80ms for 20 docs (vs 2000ms for cross-encoder on same batch)LLM-based re-ranking for complex queries
Use GPT-4 or RankZephyr when re-ranking requires reasoning:
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
llm = OpenAI(model="gpt-4o-mini", temperature=0)
def llm_rerank(query: str, docs: list[str], top_k: int = 3) -> list[dict]:
prompt = f"""Given the query: "{query}"
Rank these documents from most to least relevant. Output JSON array with format:
[{{"rank": 1, "doc_index": 0, "score": 0.95, "reason": "..."}}, ...]
Documents:
"""
for i, doc in enumerate(docs):
prompt += f"\n{i}. {doc}"
result = llm(prompt)
# Parse JSON and return top_k
import json
rankings = json.loads(result)
return rankings[:top_k]
# Test
docs = [
"Our refund policy allows 30-day returns for EU customers.",
"US customers have a 14-day return window.",
"Refunds are processed within 5-7 business days.",
"For bulk orders, custom refund terms apply.",
]
ranked = llm_rerank("What's the refund process for European bulk orders?", docs, top_k=2)
for item in ranked:
print(f"Rank {item['rank']}: Doc {item['doc_index']} (score: {item['score']})")
print(f"Reason: {item['reason']}\n")Benchmarks: Re-ranking on Thread Transfer data
We tested re-ranking on 300 customer support queries (Thread Transfer bundles):
| Method | MRR@5 | Accuracy | Avg Latency | Cost (1k queries) |
|---|---|---|---|---|
| Vector search only (top-5) | 0.68 | 64% | 380ms | $8 |
| Vector + Cohere Rerank | 0.87 | 82% | 620ms | $24 |
| Vector + BGE-reranker-large | 0.85 | 80% | 580ms | $12 (self-hosted) |
| Vector + FlashRank | 0.78 | 74% | 440ms | $9 (self-hosted) |
| Vector + ColBERT v2 | 0.82 | 77% | 510ms | $11 (self-hosted) |
| Vector + GPT-4o-mini rerank | 0.91 | 86% | 1800ms | $68 |
Takeaway: Cohere Rerank gives you 80%+ accuracy at reasonable cost/latency. Use BGE-reranker-large if you're self-hosting. Only use LLM re-ranking for high-value queries where 86% accuracy justifies 3x higher cost.
Production strategies: When to use which re-ranker
Use Cohere Rerank when:
- You need enterprise SLAs and support
- Budget allows $20-40 per 1k queries
- Latency requirement is under 700ms
- You want highest accuracy without self-hosting
Use BGE/Mixedbread when:
- You're self-hosting and want to control costs
- Multilingual support matters
- You have GPU infrastructure
- Data privacy requires on-prem deployment
Use FlashRank when:
- Latency is critical (under 500ms total)
- Queries are simple (keyword-based)
- Cost optimization is priority
- Self-hosting on CPU (no GPU required)
Use ColBERT when:
- You can pre-compute document embeddings
- Queries are multi-token (not single keywords)
- You need balance between speed and accuracy
- Dataset size is 100k+ documents
Use LLM re-ranking when:
- Queries require complex reasoning
- You need explainability ("Why is this ranked #1?")
- Accuracy is paramount (legal, medical, finance)
- Query volume is low (under 1k/day)
Combining re-ranking with hybrid search
For maximum accuracy, combine semantic search + keyword search (BM25) + re-ranking:
from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import QueryFusionRetriever
# Hybrid retriever (semantic + keyword)
vector_retriever = index.as_retriever(similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=10)
fusion_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
mode="relative_score", # Combine scores
num_queries=1
)
# Add re-ranker on top
query_engine = index.as_query_engine(
retriever=fusion_retriever,
node_postprocessors=[
CohereRerank(top_n=5, model="rerank-english-v3.0")
]
)
response = query_engine.query("EU refund policy for bulk orders")
# Gets best of semantic + keyword, then re-ranks top candidatesThread Transfer integration: Re-ranking conversation chunks
Thread Transfer bundles contain structured conversation history. Re-ranking helps find the most relevant conversation threads:
- Export bundles as JSON: Each bundle = conversation with metadata (participants, timestamps, decisions)
- Chunk by message or thread: Create chunks at message-level or thread-level granularity
- Embed + index: Use standard RAG pipeline
- Re-rank results: When querying "What did we decide about feature X?", re-ranker surfaces decision-making threads over casual mentions
Common pitfalls and fixes
Pitfall 1: Re-ranking too few candidates
If you retrieve 5 candidates and re-rank all 5, you miss opportunities. Fix: Retrieve 20-50 candidates, re-rank to top 5. This gives the re-ranker more signal to work with.
Pitfall 2: Re-ranking everything
Re-ranking 100 documents costs 100x more than re-ranking 10. Fix: Use a two-stage approach: vector search (top-50) → lightweight re-ranker (top-20) → expensive re-ranker (top-5).
Pitfall 3: Ignoring re-ranker input limits
Most re-rankers have 512-1024 token limits. If your chunks are 2000 tokens, they'll be truncated. Fix: Chunk documents smaller (300-500 tokens) or use re-rankers with longer context (Cohere Rerank 3 supports 4096 tokens).
2025 developments: Mixedbread AI and reinforcement learning
Mixedbread AI's 2025 models use three-stage reinforcement learning, achieving 57.49 BEIR score (outperforming Cohere on some benchmarks). They're Apache 2.0 licensed and fully self-hostable, making them ideal for teams with strict data privacy requirements.
Cascading re-rankers for cost optimization
Run a cheap re-ranker first, then an expensive one only on top candidates:
# Stage 1: FlashRank (fast, cheap)
flashrank_reranker = Reranker("flashrank")
stage1_results = flashrank_reranker.rank(query, docs, top_k=10)
# Stage 2: Cohere Rerank (accurate, expensive) on top 10
stage1_docs = [r.text for r in stage1_results]
cohere_reranker = Reranker("cohere", api_key="your-key")
final_results = cohere_reranker.rank(query, stage1_docs, top_k=3)
# Cost: FlashRank (free) + Cohere on 10 docs instead of 50 = 80% cost savingsFinal recommendations
Add re-ranking to every production RAG system. Start with Cohere Rerank or BGE-reranker-large (depending on budget/hosting). Retrieve 20-50 candidates, re-rank to top 5. Measure MRR@5 on 50+ test queries—you should see 15-30% accuracy gains with 200-400ms added latency.
For cost-sensitive workloads, use FlashRank or cascading re-rankers. For high-stakes queries (legal, medical), add LLM re-ranking as a final verification layer. Combine with hybrid search (semantic + BM25) for best results.
Learn more: How it works · Why bundles beat raw thread history