Thread Transfer
Context Quality Scoring: Measuring What Matters
Garbage context in, garbage output out. Here's how to score context quality and filter irrelevant or stale information before inference.
Jorgo Bardho
Founder, Thread Transfer
About 65% of developers using AI for refactoring and ~60% for testing say the assistant "misses relevant context." The #1 requested fix is "improved contextual understanding" (26% of all votes). These responses point to a deeper insight: hallucinations and quality issues often stem from poor contextual awareness. Context quality scoring is how production systems measure and fix this problem.
Why Context Quality Matters
Model quality degrades as context grows, a phenomenon researchers call "context rot." Even models grounded in reference data can hallucinate anywhere from 1% to nearly 30% of the time if the retrieval context is flawed. Context quality scoring gives you the metrics to detect, measure, and fix these issues before they reach users.
Impact on Business Metrics
| Context Quality | Hallucination Rate | User Satisfaction | Token Waste |
|---|---|---|---|
| Poor (<60%) | 20-30% | 45% | 60-80% |
| Fair (60-75%) | 10-15% | 65% | 40-50% |
| Good (75-90%) | 3-8% | 82% | 20-30% |
| Excellent (>90%) | <3% | 92% | <15% |
Source: Vectara 2025 RAG Hallucination Analysis, Qodo State of AI Code Quality 2025
Core Context Quality Metrics
1. Retrieval Precision
Definition: Percentage of retrieved context chunks that are actually relevant to the query.
precision = relevant_chunks_retrieved / total_chunks_retrieved
# Example: Retrieved 5 chunks, only 4 are relevant
precision = 4 / 5 = 0.80 (80%)Target: >80% for production systems
How to Measure
async function measure_precision(query: string, retrieved_chunks: Chunk[]) {
const relevance_scores = await Promise.all(
retrieved_chunks.map(chunk =>
llm.generate({
system: "Rate relevance of this chunk to the query. Return only a number 0-10.",
prompt: `Query: ${query}\nChunk: ${chunk.text}`
})
)
)
const relevant_count = relevance_scores.filter(score => score >= 7).length
return relevant_count / retrieved_chunks.length
}2. Retrieval Recall
Definition: Percentage of all relevant chunks that were successfully retrieved.
recall = relevant_chunks_retrieved / total_relevant_chunks_in_db
# Example: 10 relevant chunks exist, retrieved 7
recall = 7 / 10 = 0.70 (70%)Target: >85% for production systems
Measurement Strategy
Recall is harder to measure than precision because you need ground truth. Approaches:
- Human-labeled test sets: Curate 100-500 queries with known relevant documents
- Synthetic evaluation: Generate queries from documents, measure if retrieval returns source
- User feedback: Track "not helpful" / "missing information" signals
3. Context Utilization Rate
Definition: Percentage of injected context actually referenced in the model's output.
async function measure_utilization(context: string, output: string) {
const analysis = await llm.generate({
system: "Identify which parts of the context were used in generating the output. Return percentage.",
prompt: `Context: ${context}\n\nOutput: ${output}`
})
return parseFloat(analysis) / 100
}
// Target: >60% utilizationInterpretation: Low utilization (<40%) indicates context pollution—you're injecting irrelevant information that wastes tokens.
4. Faithfulness Score
Definition: Percentage of statements in the output that are grounded in the provided context.
async function measure_faithfulness(context: string, output: string) {
const statements = await extract_statements(output)
const grounded = await Promise.all(
statements.map(stmt =>
llm.generate({
system: "Is this statement supported by the context? Answer yes/no.",
prompt: `Context: ${context}\n\nStatement: ${stmt}`
})
)
)
const grounded_count = grounded.filter(r => r.toLowerCase() === "yes").length
return grounded_count / statements.length
}
// Target: >95% for high-stakes applicationsWhat it catches: Hallucinations, unsupported claims, invented facts.
5. Semantic Coherence
Definition: How well the retrieved chunks relate to each other and form a coherent narrative.
async function measure_coherence(chunks: Chunk[]) {
// Compute pairwise similarity between chunks
const embeddings = await Promise.all(chunks.map(c => embed(c.text)))
let total_similarity = 0
let pairs = 0
for (let i = 0; i < embeddings.length; i++) {
for (let j = i + 1; j < embeddings.length; j++) {
total_similarity += cosine_similarity(embeddings[i], embeddings[j])
pairs++
}
}
return pairs > 0 ? total_similarity / pairs : 0
}
// Target: >0.6 for good coherence6. Context Freshness
Definition: How recent the context is relative to when it's needed.
function measure_freshness(chunks: Chunk[], query_time: Date) {
const ages = chunks.map(chunk => {
const age_ms = query_time.getTime() - chunk.timestamp.getTime()
const age_days = age_ms / (1000 * 60 * 60 * 24)
return age_days
})
const avg_age = ages.reduce((a, b) => a + b, 0) / ages.length
// Score: 1.0 for same-day content, decays over time
return Math.max(0, 1 - (avg_age / 365))
}
// Target: >0.7 for time-sensitive applicationsAdvanced Metrics for RAG Systems
7. Answer Relevance
Measures whether the generated answer actually addresses the query, regardless of context quality.
async function measure_answer_relevance(query: string, answer: string) {
const score = await llm.generate({
system: "Rate how well this answer addresses the query. Return 0-10.",
prompt: `Query: ${query}\n\nAnswer: ${answer}`
})
return parseFloat(score) / 10
}
// Target: >0.858. Context Sufficiency
Measures whether the retrieved context contains enough information to answer the query.
async function measure_sufficiency(query: string, context: string) {
const result = await llm.generate({
system: "Can this query be fully answered using only the provided context? Answer yes/no and explain.",
prompt: `Query: ${query}\n\nContext: ${context}`
})
return result.toLowerCase().includes("yes") ? 1.0 : 0.0
}
// Target: >90%9. Chunk Boundary Quality
Measures how well chunks respect semantic boundaries (sentences, paragraphs, topics).
function measure_boundary_quality(chunks: Chunk[]) {
let good_boundaries = 0
for (const chunk of chunks) {
const text = chunk.text.trim()
// Check if chunk ends at sentence boundary
const ends_with_period = /[.!?]$/.test(text)
// Check if chunk doesn't split mid-word
const no_truncation = !text.endsWith("-") && !text.endsWith("...")
// Check if chunk is semantically complete
const complete = ends_with_period && no_truncation
if (complete) good_boundaries++
}
return good_boundaries / chunks.length
}
// Target: >85%AI Observability Framework
AI observability is the continuous practice of tracing AI workflows end to end, evaluating quality online and offline, routing ambiguous cases to human review, and alerting on user-impacting issues.
Monitoring Stack
class ContextQualityMonitor {
async track_query(query: string, context: Chunk[], output: string) {
const metrics = {
timestamp: Date.now(),
query_id: generate_id(),
// Retrieval metrics
precision: await this.measure_precision(query, context),
coherence: await this.measure_coherence(context),
freshness: this.measure_freshness(context, new Date()),
// Generation metrics
utilization: await this.measure_utilization(context, output),
faithfulness: await this.measure_faithfulness(context, output),
answer_relevance: await this.measure_answer_relevance(query, output),
// Metadata
num_chunks: context.length,
total_tokens: count_tokens(context),
latency_ms: this.latency
}
await this.log_metrics(metrics)
// Alert on quality issues
if (metrics.faithfulness < 0.8) {
await this.alert("Low faithfulness score", metrics)
}
return metrics
}
}Dashboards
Track these aggregates over time:
| Metric | Aggregation | Alert Threshold |
|---|---|---|
| Precision | P50, P95, P99 | P50 < 0.75 |
| Faithfulness | Mean, Min | Mean < 0.90 |
| Utilization | Mean | Mean < 0.50 |
| Latency (P95) | P95 | P95 > 3000ms |
| Cost per query | Mean, P95 | Mean > $0.10 |
Automated Context Quality Testing
Synthetic Evaluation Pipeline
async function run_evaluation(test_set: TestCase[]) {
const results = []
for (const test of test_set) {
// Retrieve context
const context = await retrieve(test.query)
// Generate answer
const answer = await generate(test.query, context)
// Score quality
const metrics = {
precision: await measure_precision(test.query, context),
recall: calculate_recall(context, test.expected_chunks),
faithfulness: await measure_faithfulness(context, answer),
answer_correctness: compare_answers(answer, test.expected_answer)
}
results.push({ test, metrics })
}
return aggregate_metrics(results)
}
// Run daily or on every deployment
const baseline = await run_evaluation(test_set)
console.log("Precision:", baseline.precision) // 0.83
console.log("Recall:", baseline.recall) // 0.87
console.log("Faithfulness:", baseline.faithfulness) // 0.94Continuous Evaluation
- Pre-deployment: Run evaluation suite, block if metrics drop >5%
- Canary testing: Route 5% of traffic to new version, compare metrics
- A/B testing: Split traffic 50/50, measure quality difference
- Shadow mode: Run new system in parallel, compare outputs
Debugging Poor Context Quality
Diagnosis Framework
| Symptom | Likely Cause | Fix |
|---|---|---|
| Low precision (<70%) | Poor retrieval ranking | Improve embedding model, add reranker, tune top-k |
| Low recall (<80%) | Missing chunks, poor chunking | Optimize chunk size/overlap, check indexing coverage |
| Low utilization (<50%) | Too much irrelevant context | Reduce top-k, improve filtering, add metadata filters |
| Low faithfulness (<90%) | Model hallucinating | Stronger grounding instructions, add citations, use RAG |
| Low coherence (<0.5) | Retrieving unrelated chunks | Improve query embedding, add query expansion, use MMR |
| High latency (>3s) | Too many chunks, slow embedding | Reduce top-k, use faster embedding model, add caching |
Investigation Workflow
- Identify failing queries: Find queries with quality scores below threshold
- Inspect retrieved context: Manually review chunks - are they relevant?
- Check expected chunks: Were the right chunks in the database?
- Analyze embeddings: Visualize query and chunk embeddings in 2D space
- Review output: Identify specific hallucinations or errors
- Test fixes: Try different retrieval strategies, chunking, prompts
- Re-evaluate: Measure metrics again to confirm improvement
Embedding Quality
When embeddings fail to represent the semantic meaning of the source data, AI will receive the wrong context regardless of vector database or model performance. Embedding quality is becoming a mission-critical priority in 2025.
Measuring Embedding Quality
async function evaluate_embeddings(test_pairs: SimilarityPair[]) {
const results = []
for (const pair of test_pairs) {
const emb1 = await embed(pair.text1)
const emb2 = await embed(pair.text2)
const similarity = cosine_similarity(emb1, emb2)
const expected = pair.should_be_similar ? 1 : 0
results.push({
predicted: similarity,
expected: expected,
error: Math.abs(similarity - expected)
})
}
return {
mean_error: mean(results.map(r => r.error)),
accuracy: results.filter(r => r.error < 0.3).length / results.length
}
}
// Test with known similar/dissimilar pairs
const test_pairs = [
{ text1: "cat", text2: "kitten", should_be_similar: true },
{ text1: "cat", text2: "database", should_be_similar: false }
]
const quality = await evaluate_embeddings(test_pairs)
// Target: accuracy >85%, mean_error <0.2Human-in-the-Loop Quality Assurance
Route ambiguous cases to human review to build ground truth datasets and catch edge cases.
When to Flag for Review
- Faithfulness score < 0.85
- Utilization < 0.40 (too much context waste)
- User explicitly reports issue ("not helpful", "incorrect")
- Conflicting information in retrieved chunks
- High-stakes queries (legal, medical, financial)
Review Workflow
async function flag_for_review(query_id: string, reason: string) {
await review_queue.add({
query_id,
reason,
priority: calculate_priority(reason),
assigned_to: null
})
}
// Reviewer dashboard shows:
// - Original query
// - Retrieved context chunks
// - Generated answer
// - Quality metrics
// - Ask: "Is this answer correct and helpful? yes/no"
const feedback = await get_reviewer_feedback(query_id)
if (!feedback.approved) {
// Update training data, retrain retrieval model
await update_training_set(query_id, feedback)
}Thread Transfer's Quality Scoring
Thread Transfer bundles are pre-scored for context quality before delivery:
Bundle Quality Metrics
- Completeness: All referenced messages, links, and files included
- Coherence: Conversation flow preserved, topics clearly identified
- Decision extraction: Key decisions and rationale captured
- Stakeholder identification: All participants and their roles noted
- Action item accuracy: Todos, owners, and deadlines extracted
- Token efficiency: 40-80% reduction vs. raw thread, no information loss
Bundles with quality scores below threshold are flagged for human review before delivery, ensuring every bundle meets production standards.
Production Checklist
- Define target metrics: Precision, recall, faithfulness, utilization
- Build test set: 100-500 queries with expected outputs
- Instrument tracking: Log all queries with context and outputs
- Set up dashboards: Real-time monitoring of quality metrics
- Configure alerts: Trigger on drops below threshold
- Run daily evals: Automated testing against baseline
- Human review queue: Flag edge cases for expert feedback
- Continuous improvement: Use feedback to retrain and refine
The Bottom Line
You can't improve what you don't measure. Context quality scoring transforms "AI is unreliable" into actionable metrics you can track, debug, and optimize. Systems that measure context quality reduce hallucinations by 70-90%, cut token costs by 40-60%, and improve user satisfaction by 30-50%.
Start with the core metrics: precision, recall, faithfulness, utilization. Instrument your system to track them on every query. Set up alerts for regressions. Build a test set and run evals daily. Add human review for edge cases.
The goal: measurable, improvable, production-grade context quality.
Learn more: How it works · Why bundles beat raw thread history