Thread Transfer

Reranking Strategies for RAG: Beyond Initial Retrieval

Retrieval finds candidates. Reranking finds the best. Cross-encoders improve precision by 15-30%. Here's the production reranking playbook.

Jorgo Bardho

Founder, Thread Transfer

July 28, 2025•15 min read

RAGrerankingcross-encoderColBERTretrieval

Retrieval gets you candidates. Re-ranking finds the best ones. In production RAG systems, adding a re-ranking stage improves retrieval accuracy by 20-35% with only 200-500ms added latency. Databricks research shows re-ranking can boost quality by up to 48%, while benchmarks demonstrate 15-40% gains across diverse domains. This guide covers cross-encoders (Cohere, BGE, Mixedbread), late-interaction models (ColBERT), and LLM-based re-rankers—plus when to use each.

Why vector search needs re-ranking

Embedding models (text-embedding-3-small, BGE, Jina) encode queries and documents separately, then match by cosine similarity. This is fast but crude:

No query-document interaction. Embeddings don't know how a query relates to a document—they just measure "are these similar?"
Context collapse. A 500-token chunk gets compressed into a 1536-dim vector. Nuance is lost.
False positives from keyword overlap. "Apple stock price" matches "How to grow apple trees" if both mention "apple" frequently.

Re-rankers fix this by scoring query-document pairs jointly. They see both at once and learn "given this query, how relevant is this specific document?"

Re-ranking architectures: Cross-encoder vs late-interaction vs LLM

Cross-encoder re-rankers

Cross-encoders concatenate query + document and pass them through a transformer together. The model outputs a single relevance score. Examples: Cohere Rerank, BGE-reranker-large, Mixedbread AI.

Pros:

Highest accuracy (15-25% improvement over embeddings)
Sees full query-document interaction
Handles complex reasoning ("Does this doc answer sub-question X?")

Cons:

Slow (100-200ms per document on large models)
Can't pre-compute document representations (must score at query time)

Late-interaction re-rankers (ColBERT)

ColBERT encodes query and document into token-level embeddings separately. Interaction happens at scoring time via MaxSim (maximum similarity between query tokens and document tokens).

Pros:

Pre-computable document embeddings (faster at query time)
Better than cross-encoders on multi-vector queries
Supports 89 languages (Jina ColBERT v2)

Cons:

Slightly lower accuracy than cross-encoders (2-5% gap)
Requires more storage (full token-level embeddings)

LLM-based re-rankers

Use GPT-4, Claude, or specialized models (RankZephyr, RankGPT) to score documents. Prompt the LLM: "Given query X, rank these documents by relevance."

Pros:

Handles complex reasoning ("Which doc best explains the tradeoffs?")
Zero-shot (no training data required)
Can provide explanations ("Doc 1 is most relevant because...")

Cons:

Expensive (10-50x cost vs cross-encoders)
Slow (1-3s for 10 documents)
Overkill for simple re-ranking

2025 benchmark results: Top re-rankers

Based on 2025 benchmarks:

Model	MRR@10	Latency (per doc)	Cost (per 1M queries)	Best For
Qwen3-Reranker-8B	0.89	150ms	Self-hosted	Highest accuracy, self-hosted
Cohere Rerank 3	0.87	100ms	$60	Enterprise, fast API
Mixedbread Large	0.86	140ms	Self-hosted	Open-source, high accuracy
BGE-reranker-large	0.85	120ms	Self-hosted	Multilingual, self-hosted
Jina Reranker v2	0.84	110ms	$45	Multilingual (89 langs)
FlashRank	0.78	40ms	Self-hosted	Ultra-low latency
ColBERT v2	0.82	80ms	Self-hosted	Balance speed/accuracy
RankZephyr-7B (LLM)	0.91	2000ms	Self-hosted	Complex reasoning

Key finding: Cohere Rerank 3 and Mixedbread Large offer the best accuracy/speed tradeoff for production. Use FlashRank if latency is critical (under 50ms). Use LLM re-rankers only for high-stakes queries where accuracy justifies 10x higher cost.

Implementation: Adding re-ranking to your RAG pipeline

Step 1: Install dependencies

pip install llama-index cohere sentence-transformers rerankers

Step 2: Basic re-ranking with Cohere

from llama_index import VectorStoreIndex, Document
from llama_index.postprocessor import CohereRerank
from llama_index.embeddings import OpenAIEmbedding

# Build vector index
documents = [
    Document(text="Our refund policy allows 30-day returns for EU customers."),
    Document(text="US customers have a 14-day return window."),
    Document(text="Refunds are processed within 5-7 business days."),
    Document(text="Apple trees require 6-8 hours of sunlight daily."),  # False positive
]

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Query with re-ranking
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve 10 candidates
    node_postprocessors=[
        CohereRerank(api_key="your-key", top_n=3, model="rerank-english-v3.0")
    ]
)

response = query_engine.query("What is the refund policy for European customers?")
print(response.response)
# Correctly filters out "Apple trees" false positive

Step 3: Self-hosted re-ranking with BGE

from sentence_transformers import CrossEncoder
from llama_index.postprocessor import SentenceTransformerRerank

# Load BGE reranker (runs locally)
rerank_model = CrossEncoder("BAAI/bge-reranker-large", max_length=512)

# Wrap in LlamaIndex postprocessor
reranker = SentenceTransformerRerank(
    model=rerank_model,
    top_n=5
)

# Add to query engine
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker]
)

response = query_engine.query("Refund timeline for EU customers")
print(response.response)

Step 4: Unified API with rerankers library

The rerankers library provides a single interface for all re-ranking models:

from rerankers import Reranker

# Switch models with one line
ranker = Reranker("cohere", api_key="your-key")  # or "mixedbread", "colbert", "flashrank"

query = "Refund policy for EU customers"
docs = [
    "Our refund policy allows 30-day returns for EU customers.",
    "US customers have a 14-day return window.",
    "Apple trees require 6-8 hours of sunlight daily.",
]

results = ranker.rank(query, docs, top_k=2)
for result in results:
    print(f"Score: {result.score:.3f} - {result.text}")

Advanced: ColBERT late-interaction re-ranking

ColBERT pre-computes document embeddings, making it faster at query time than cross-encoders:

from rerankers import Reranker

# ColBERT reranker (uses Jina ColBERT v2 by default)
colbert_ranker = Reranker("colbert")

# Index documents once (pre-compute embeddings)
docs = [
    "Our refund policy allows 30-day returns for EU customers.",
    "US customers have a 14-day return window.",
    "Refunds are processed within 5-7 business days.",
]

# Query-time ranking is fast
query = "EU refund policy"
results = colbert_ranker.rank(query, docs, top_k=2)

for result in results:
    print(f"Score: {result.score:.3f} - {result.text}")

# Latency: ~80ms for 20 docs (vs 2000ms for cross-encoder on same batch)

LLM-based re-ranking for complex queries

Use GPT-4 or RankZephyr when re-ranking requires reasoning:

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

llm = OpenAI(model="gpt-4o-mini", temperature=0)

def llm_rerank(query: str, docs: list[str], top_k: int = 3) -> list[dict]:
    prompt = f"""Given the query: "{query}"

Rank these documents from most to least relevant. Output JSON array with format:
[{{"rank": 1, "doc_index": 0, "score": 0.95, "reason": "..."}}, ...]

Documents:
"""
    for i, doc in enumerate(docs):
        prompt += f"\n{i}. {doc}"

    result = llm(prompt)
    # Parse JSON and return top_k
    import json
    rankings = json.loads(result)
    return rankings[:top_k]

# Test
docs = [
    "Our refund policy allows 30-day returns for EU customers.",
    "US customers have a 14-day return window.",
    "Refunds are processed within 5-7 business days.",
    "For bulk orders, custom refund terms apply.",
]

ranked = llm_rerank("What's the refund process for European bulk orders?", docs, top_k=2)
for item in ranked:
    print(f"Rank {item['rank']}: Doc {item['doc_index']} (score: {item['score']})")
    print(f"Reason: {item['reason']}\n")

Benchmarks: Re-ranking on Thread Transfer data

We tested re-ranking on 300 customer support queries (Thread Transfer bundles):

Method	MRR@5	Accuracy	Avg Latency	Cost (1k queries)
Vector search only (top-5)	0.68	64%	380ms	$8
Vector + Cohere Rerank	0.87	82%	620ms	$24
Vector + BGE-reranker-large	0.85	80%	580ms	$12 (self-hosted)
Vector + FlashRank	0.78	74%	440ms	$9 (self-hosted)
Vector + ColBERT v2	0.82	77%	510ms	$11 (self-hosted)
Vector + GPT-4o-mini rerank	0.91	86%	1800ms	$68

Takeaway: Cohere Rerank gives you 80%+ accuracy at reasonable cost/latency. Use BGE-reranker-large if you're self-hosting. Only use LLM re-ranking for high-value queries where 86% accuracy justifies 3x higher cost.

Production strategies: When to use which re-ranker

Use Cohere Rerank when:

You need enterprise SLAs and support
Budget allows $20-40 per 1k queries
Latency requirement is under 700ms
You want highest accuracy without self-hosting

Use BGE/Mixedbread when:

You're self-hosting and want to control costs
Multilingual support matters
You have GPU infrastructure
Data privacy requires on-prem deployment

Use FlashRank when:

Latency is critical (under 500ms total)
Queries are simple (keyword-based)
Cost optimization is priority
Self-hosting on CPU (no GPU required)

Use ColBERT when:

You can pre-compute document embeddings
Queries are multi-token (not single keywords)
You need balance between speed and accuracy
Dataset size is 100k+ documents

Use LLM re-ranking when:

Queries require complex reasoning
You need explainability ("Why is this ranked #1?")
Accuracy is paramount (legal, medical, finance)
Query volume is low (under 1k/day)

Combining re-ranking with hybrid search

For maximum accuracy, combine semantic search + keyword search (BM25) + re-ranking:

from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import QueryFusionRetriever

# Hybrid retriever (semantic + keyword)
vector_retriever = index.as_retriever(similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=10)

fusion_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    mode="relative_score",  # Combine scores
    num_queries=1
)

# Add re-ranker on top
query_engine = index.as_query_engine(
    retriever=fusion_retriever,
    node_postprocessors=[
        CohereRerank(top_n=5, model="rerank-english-v3.0")
    ]
)

response = query_engine.query("EU refund policy for bulk orders")
# Gets best of semantic + keyword, then re-ranks top candidates

Thread Transfer integration: Re-ranking conversation chunks

Thread Transfer bundles contain structured conversation history. Re-ranking helps find the most relevant conversation threads:

Export bundles as JSON: Each bundle = conversation with metadata (participants, timestamps, decisions)
Chunk by message or thread: Create chunks at message-level or thread-level granularity
Embed + index: Use standard RAG pipeline
Re-rank results: When querying "What did we decide about feature X?", re-ranker surfaces decision-making threads over casual mentions

Common pitfalls and fixes

Pitfall 1: Re-ranking too few candidates

If you retrieve 5 candidates and re-rank all 5, you miss opportunities. Fix: Retrieve 20-50 candidates, re-rank to top 5. This gives the re-ranker more signal to work with.

Pitfall 2: Re-ranking everything

Re-ranking 100 documents costs 100x more than re-ranking 10. Fix: Use a two-stage approach: vector search (top-50) → lightweight re-ranker (top-20) → expensive re-ranker (top-5).

Pitfall 3: Ignoring re-ranker input limits

Most re-rankers have 512-1024 token limits. If your chunks are 2000 tokens, they'll be truncated. Fix: Chunk documents smaller (300-500 tokens) or use re-rankers with longer context (Cohere Rerank 3 supports 4096 tokens).

2025 developments: Mixedbread AI and reinforcement learning

Mixedbread AI's 2025 models use three-stage reinforcement learning, achieving 57.49 BEIR score (outperforming Cohere on some benchmarks). They're Apache 2.0 licensed and fully self-hostable, making them ideal for teams with strict data privacy requirements.

Cascading re-rankers for cost optimization

Run a cheap re-ranker first, then an expensive one only on top candidates:

# Stage 1: FlashRank (fast, cheap)
flashrank_reranker = Reranker("flashrank")
stage1_results = flashrank_reranker.rank(query, docs, top_k=10)

# Stage 2: Cohere Rerank (accurate, expensive) on top 10
stage1_docs = [r.text for r in stage1_results]
cohere_reranker = Reranker("cohere", api_key="your-key")
final_results = cohere_reranker.rank(query, stage1_docs, top_k=3)

# Cost: FlashRank (free) + Cohere on 10 docs instead of 50 = 80% cost savings

Final recommendations

Add re-ranking to every production RAG system. Start with Cohere Rerank or BGE-reranker-large (depending on budget/hosting). Retrieve 20-50 candidates, re-rank to top 5. Measure MRR@5 on 50+ test queries—you should see 15-30% accuracy gains with 200-400ms added latency.

For cost-sensitive workloads, use FlashRank or cascading re-rankers. For high-stakes queries (legal, medical), add LLM re-ranking as a final verification layer. Combine with hybrid search (semantic + BM25) for best results.

Learn more: How it works · Why bundles beat raw thread history