Skip to main content

Thread Transfer

Reranking Strategies for RAG: Beyond Initial Retrieval

Retrieval finds candidates. Reranking finds the best. Cross-encoders improve precision by 15-30%. Here's the production reranking playbook.

Jorgo Bardho

Founder, Thread Transfer

July 28, 202515 min read
RAGrerankingcross-encoderColBERTretrieval
Reranking pipeline diagram

Retrieval gets you candidates. Re-ranking finds the best ones. In production RAG systems, adding a re-ranking stage improves retrieval accuracy by 20-35% with only 200-500ms added latency. Databricks research shows re-ranking can boost quality by up to 48%, while benchmarks demonstrate 15-40% gains across diverse domains. This guide covers cross-encoders (Cohere, BGE, Mixedbread), late-interaction models (ColBERT), and LLM-based re-rankers—plus when to use each.

Why vector search needs re-ranking

Embedding models (text-embedding-3-small, BGE, Jina) encode queries and documents separately, then match by cosine similarity. This is fast but crude:

  • No query-document interaction. Embeddings don't know how a query relates to a document—they just measure "are these similar?"
  • Context collapse. A 500-token chunk gets compressed into a 1536-dim vector. Nuance is lost.
  • False positives from keyword overlap. "Apple stock price" matches "How to grow apple trees" if both mention "apple" frequently.

Re-rankers fix this by scoring query-document pairs jointly. They see both at once and learn "given this query, how relevant is this specific document?"

Re-ranking architectures: Cross-encoder vs late-interaction vs LLM

Cross-encoder re-rankers

Cross-encoders concatenate query + document and pass them through a transformer together. The model outputs a single relevance score. Examples: Cohere Rerank, BGE-reranker-large, Mixedbread AI.

Pros:

  • Highest accuracy (15-25% improvement over embeddings)
  • Sees full query-document interaction
  • Handles complex reasoning ("Does this doc answer sub-question X?")

Cons:

  • Slow (100-200ms per document on large models)
  • Can't pre-compute document representations (must score at query time)

Late-interaction re-rankers (ColBERT)

ColBERT encodes query and document into token-level embeddings separately. Interaction happens at scoring time via MaxSim (maximum similarity between query tokens and document tokens).

Pros:

  • Pre-computable document embeddings (faster at query time)
  • Better than cross-encoders on multi-vector queries
  • Supports 89 languages (Jina ColBERT v2)

Cons:

  • Slightly lower accuracy than cross-encoders (2-5% gap)
  • Requires more storage (full token-level embeddings)

LLM-based re-rankers

Use GPT-4, Claude, or specialized models (RankZephyr, RankGPT) to score documents. Prompt the LLM: "Given query X, rank these documents by relevance."

Pros:

  • Handles complex reasoning ("Which doc best explains the tradeoffs?")
  • Zero-shot (no training data required)
  • Can provide explanations ("Doc 1 is most relevant because...")

Cons:

  • Expensive (10-50x cost vs cross-encoders)
  • Slow (1-3s for 10 documents)
  • Overkill for simple re-ranking

2025 benchmark results: Top re-rankers

Based on 2025 benchmarks:

ModelMRR@10Latency (per doc)Cost (per 1M queries)Best For
Qwen3-Reranker-8B0.89150msSelf-hostedHighest accuracy, self-hosted
Cohere Rerank 30.87100ms$60Enterprise, fast API
Mixedbread Large0.86140msSelf-hostedOpen-source, high accuracy
BGE-reranker-large0.85120msSelf-hostedMultilingual, self-hosted
Jina Reranker v20.84110ms$45Multilingual (89 langs)
FlashRank0.7840msSelf-hostedUltra-low latency
ColBERT v20.8280msSelf-hostedBalance speed/accuracy
RankZephyr-7B (LLM)0.912000msSelf-hostedComplex reasoning

Key finding: Cohere Rerank 3 and Mixedbread Large offer the best accuracy/speed tradeoff for production. Use FlashRank if latency is critical (under 50ms). Use LLM re-rankers only for high-stakes queries where accuracy justifies 10x higher cost.

Implementation: Adding re-ranking to your RAG pipeline

Step 1: Install dependencies

pip install llama-index cohere sentence-transformers rerankers

Step 2: Basic re-ranking with Cohere

from llama_index import VectorStoreIndex, Document
from llama_index.postprocessor import CohereRerank
from llama_index.embeddings import OpenAIEmbedding

# Build vector index
documents = [
    Document(text="Our refund policy allows 30-day returns for EU customers."),
    Document(text="US customers have a 14-day return window."),
    Document(text="Refunds are processed within 5-7 business days."),
    Document(text="Apple trees require 6-8 hours of sunlight daily."),  # False positive
]

embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Query with re-ranking
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve 10 candidates
    node_postprocessors=[
        CohereRerank(api_key="your-key", top_n=3, model="rerank-english-v3.0")
    ]
)

response = query_engine.query("What is the refund policy for European customers?")
print(response.response)
# Correctly filters out "Apple trees" false positive

Step 3: Self-hosted re-ranking with BGE

from sentence_transformers import CrossEncoder
from llama_index.postprocessor import SentenceTransformerRerank

# Load BGE reranker (runs locally)
rerank_model = CrossEncoder("BAAI/bge-reranker-large", max_length=512)

# Wrap in LlamaIndex postprocessor
reranker = SentenceTransformerRerank(
    model=rerank_model,
    top_n=5
)

# Add to query engine
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[reranker]
)

response = query_engine.query("Refund timeline for EU customers")
print(response.response)

Step 4: Unified API with rerankers library

The rerankers library provides a single interface for all re-ranking models:

from rerankers import Reranker

# Switch models with one line
ranker = Reranker("cohere", api_key="your-key")  # or "mixedbread", "colbert", "flashrank"

query = "Refund policy for EU customers"
docs = [
    "Our refund policy allows 30-day returns for EU customers.",
    "US customers have a 14-day return window.",
    "Apple trees require 6-8 hours of sunlight daily.",
]

results = ranker.rank(query, docs, top_k=2)
for result in results:
    print(f"Score: {result.score:.3f} - {result.text}")

Advanced: ColBERT late-interaction re-ranking

ColBERT pre-computes document embeddings, making it faster at query time than cross-encoders:

from rerankers import Reranker

# ColBERT reranker (uses Jina ColBERT v2 by default)
colbert_ranker = Reranker("colbert")

# Index documents once (pre-compute embeddings)
docs = [
    "Our refund policy allows 30-day returns for EU customers.",
    "US customers have a 14-day return window.",
    "Refunds are processed within 5-7 business days.",
]

# Query-time ranking is fast
query = "EU refund policy"
results = colbert_ranker.rank(query, docs, top_k=2)

for result in results:
    print(f"Score: {result.score:.3f} - {result.text}")

# Latency: ~80ms for 20 docs (vs 2000ms for cross-encoder on same batch)

LLM-based re-ranking for complex queries

Use GPT-4 or RankZephyr when re-ranking requires reasoning:

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

llm = OpenAI(model="gpt-4o-mini", temperature=0)

def llm_rerank(query: str, docs: list[str], top_k: int = 3) -> list[dict]:
    prompt = f"""Given the query: "{query}"

Rank these documents from most to least relevant. Output JSON array with format:
[{{"rank": 1, "doc_index": 0, "score": 0.95, "reason": "..."}}, ...]

Documents:
"""
    for i, doc in enumerate(docs):
        prompt += f"\n{i}. {doc}"

    result = llm(prompt)
    # Parse JSON and return top_k
    import json
    rankings = json.loads(result)
    return rankings[:top_k]

# Test
docs = [
    "Our refund policy allows 30-day returns for EU customers.",
    "US customers have a 14-day return window.",
    "Refunds are processed within 5-7 business days.",
    "For bulk orders, custom refund terms apply.",
]

ranked = llm_rerank("What's the refund process for European bulk orders?", docs, top_k=2)
for item in ranked:
    print(f"Rank {item['rank']}: Doc {item['doc_index']} (score: {item['score']})")
    print(f"Reason: {item['reason']}\n")

Benchmarks: Re-ranking on Thread Transfer data

We tested re-ranking on 300 customer support queries (Thread Transfer bundles):

MethodMRR@5AccuracyAvg LatencyCost (1k queries)
Vector search only (top-5)0.6864%380ms$8
Vector + Cohere Rerank0.8782%620ms$24
Vector + BGE-reranker-large0.8580%580ms$12 (self-hosted)
Vector + FlashRank0.7874%440ms$9 (self-hosted)
Vector + ColBERT v20.8277%510ms$11 (self-hosted)
Vector + GPT-4o-mini rerank0.9186%1800ms$68

Takeaway: Cohere Rerank gives you 80%+ accuracy at reasonable cost/latency. Use BGE-reranker-large if you're self-hosting. Only use LLM re-ranking for high-value queries where 86% accuracy justifies 3x higher cost.

Production strategies: When to use which re-ranker

Use Cohere Rerank when:

  • You need enterprise SLAs and support
  • Budget allows $20-40 per 1k queries
  • Latency requirement is under 700ms
  • You want highest accuracy without self-hosting

Use BGE/Mixedbread when:

  • You're self-hosting and want to control costs
  • Multilingual support matters
  • You have GPU infrastructure
  • Data privacy requires on-prem deployment

Use FlashRank when:

  • Latency is critical (under 500ms total)
  • Queries are simple (keyword-based)
  • Cost optimization is priority
  • Self-hosting on CPU (no GPU required)

Use ColBERT when:

  • You can pre-compute document embeddings
  • Queries are multi-token (not single keywords)
  • You need balance between speed and accuracy
  • Dataset size is 100k+ documents

Use LLM re-ranking when:

  • Queries require complex reasoning
  • You need explainability ("Why is this ranked #1?")
  • Accuracy is paramount (legal, medical, finance)
  • Query volume is low (under 1k/day)

Combining re-ranking with hybrid search

For maximum accuracy, combine semantic search + keyword search (BM25) + re-ranking:

from llama_index.retrievers import BM25Retriever
from llama_index.retrievers import QueryFusionRetriever

# Hybrid retriever (semantic + keyword)
vector_retriever = index.as_retriever(similarity_top_k=10)
bm25_retriever = BM25Retriever.from_defaults(docstore=index.docstore, similarity_top_k=10)

fusion_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    mode="relative_score",  # Combine scores
    num_queries=1
)

# Add re-ranker on top
query_engine = index.as_query_engine(
    retriever=fusion_retriever,
    node_postprocessors=[
        CohereRerank(top_n=5, model="rerank-english-v3.0")
    ]
)

response = query_engine.query("EU refund policy for bulk orders")
# Gets best of semantic + keyword, then re-ranks top candidates

Thread Transfer integration: Re-ranking conversation chunks

Thread Transfer bundles contain structured conversation history. Re-ranking helps find the most relevant conversation threads:

  • Export bundles as JSON: Each bundle = conversation with metadata (participants, timestamps, decisions)
  • Chunk by message or thread: Create chunks at message-level or thread-level granularity
  • Embed + index: Use standard RAG pipeline
  • Re-rank results: When querying "What did we decide about feature X?", re-ranker surfaces decision-making threads over casual mentions

Common pitfalls and fixes

Pitfall 1: Re-ranking too few candidates

If you retrieve 5 candidates and re-rank all 5, you miss opportunities. Fix: Retrieve 20-50 candidates, re-rank to top 5. This gives the re-ranker more signal to work with.

Pitfall 2: Re-ranking everything

Re-ranking 100 documents costs 100x more than re-ranking 10. Fix: Use a two-stage approach: vector search (top-50) → lightweight re-ranker (top-20) → expensive re-ranker (top-5).

Pitfall 3: Ignoring re-ranker input limits

Most re-rankers have 512-1024 token limits. If your chunks are 2000 tokens, they'll be truncated. Fix: Chunk documents smaller (300-500 tokens) or use re-rankers with longer context (Cohere Rerank 3 supports 4096 tokens).

2025 developments: Mixedbread AI and reinforcement learning

Mixedbread AI's 2025 models use three-stage reinforcement learning, achieving 57.49 BEIR score (outperforming Cohere on some benchmarks). They're Apache 2.0 licensed and fully self-hostable, making them ideal for teams with strict data privacy requirements.

Cascading re-rankers for cost optimization

Run a cheap re-ranker first, then an expensive one only on top candidates:

# Stage 1: FlashRank (fast, cheap)
flashrank_reranker = Reranker("flashrank")
stage1_results = flashrank_reranker.rank(query, docs, top_k=10)

# Stage 2: Cohere Rerank (accurate, expensive) on top 10
stage1_docs = [r.text for r in stage1_results]
cohere_reranker = Reranker("cohere", api_key="your-key")
final_results = cohere_reranker.rank(query, stage1_docs, top_k=3)

# Cost: FlashRank (free) + Cohere on 10 docs instead of 50 = 80% cost savings

Final recommendations

Add re-ranking to every production RAG system. Start with Cohere Rerank or BGE-reranker-large (depending on budget/hosting). Retrieve 20-50 candidates, re-rank to top 5. Measure MRR@5 on 50+ test queries—you should see 15-30% accuracy gains with 200-400ms added latency.

For cost-sensitive workloads, use FlashRank or cascading re-rankers. For high-stakes queries (legal, medical), add LLM re-ranking as a final verification layer. Combine with hybrid search (semantic + BM25) for best results.