Thread Transfer

Context Quality Scoring: Measuring What Matters

Garbage context in, garbage output out. Here's how to score context quality and filter irrelevant or stale information before inference.

Jorgo Bardho

Founder, Thread Transfer

July 15, 2025•13 min read

context qualitymetricsevaluationRAG

About 65% of developers using AI for refactoring and ~60% for testing say the assistant "misses relevant context." The #1 requested fix is "improved contextual understanding" (26% of all votes). These responses point to a deeper insight: hallucinations and quality issues often stem from poor contextual awareness. Context quality scoring is how production systems measure and fix this problem.

Why Context Quality Matters

Model quality degrades as context grows, a phenomenon researchers call "context rot." Even models grounded in reference data can hallucinate anywhere from 1% to nearly 30% of the time if the retrieval context is flawed. Context quality scoring gives you the metrics to detect, measure, and fix these issues before they reach users.

Impact on Business Metrics

Context Quality	Hallucination Rate	User Satisfaction	Token Waste
Poor (<60%)	20-30%	45%	60-80%
Fair (60-75%)	10-15%	65%	40-50%
Good (75-90%)	3-8%	82%	20-30%
Excellent (>90%)	<3%	92%	<15%

Source: Vectara 2025 RAG Hallucination Analysis, Qodo State of AI Code Quality 2025

Core Context Quality Metrics

1. Retrieval Precision

Definition: Percentage of retrieved context chunks that are actually relevant to the query.

precision = relevant_chunks_retrieved / total_chunks_retrieved

# Example: Retrieved 5 chunks, only 4 are relevant
precision = 4 / 5 = 0.80 (80%)

Target: >80% for production systems

How to Measure

async function measure_precision(query: string, retrieved_chunks: Chunk[]) {
  const relevance_scores = await Promise.all(
    retrieved_chunks.map(chunk =>
      llm.generate({
        system: "Rate relevance of this chunk to the query. Return only a number 0-10.",
        prompt: `Query: ${query}\nChunk: ${chunk.text}`
      })
    )
  )

  const relevant_count = relevance_scores.filter(score => score >= 7).length
  return relevant_count / retrieved_chunks.length
}

2. Retrieval Recall

Definition: Percentage of all relevant chunks that were successfully retrieved.

recall = relevant_chunks_retrieved / total_relevant_chunks_in_db

# Example: 10 relevant chunks exist, retrieved 7
recall = 7 / 10 = 0.70 (70%)

Target: >85% for production systems

Measurement Strategy

Recall is harder to measure than precision because you need ground truth. Approaches:

Human-labeled test sets: Curate 100-500 queries with known relevant documents
Synthetic evaluation: Generate queries from documents, measure if retrieval returns source
User feedback: Track "not helpful" / "missing information" signals

3. Context Utilization Rate

Definition: Percentage of injected context actually referenced in the model's output.

async function measure_utilization(context: string, output: string) {
  const analysis = await llm.generate({
    system: "Identify which parts of the context were used in generating the output. Return percentage.",
    prompt: `Context: ${context}\n\nOutput: ${output}`
  })

  return parseFloat(analysis) / 100
}

// Target: >60% utilization

Interpretation: Low utilization (<40%) indicates context pollution—you're injecting irrelevant information that wastes tokens.

4. Faithfulness Score

Definition: Percentage of statements in the output that are grounded in the provided context.

async function measure_faithfulness(context: string, output: string) {
  const statements = await extract_statements(output)

  const grounded = await Promise.all(
    statements.map(stmt =>
      llm.generate({
        system: "Is this statement supported by the context? Answer yes/no.",
        prompt: `Context: ${context}\n\nStatement: ${stmt}`
      })
    )
  )

  const grounded_count = grounded.filter(r => r.toLowerCase() === "yes").length
  return grounded_count / statements.length
}

// Target: >95% for high-stakes applications

What it catches: Hallucinations, unsupported claims, invented facts.

5. Semantic Coherence

Definition: How well the retrieved chunks relate to each other and form a coherent narrative.

async function measure_coherence(chunks: Chunk[]) {
  // Compute pairwise similarity between chunks
  const embeddings = await Promise.all(chunks.map(c => embed(c.text)))

  let total_similarity = 0
  let pairs = 0

  for (let i = 0; i < embeddings.length; i++) {
    for (let j = i + 1; j < embeddings.length; j++) {
      total_similarity += cosine_similarity(embeddings[i], embeddings[j])
      pairs++
    }
  }

  return pairs &gt; 0 ? total_similarity / pairs : 0
}

// Target: >0.6 for good coherence

6. Context Freshness

Definition: How recent the context is relative to when it's needed.

function measure_freshness(chunks: Chunk[], query_time: Date) {
  const ages = chunks.map(chunk => {
    const age_ms = query_time.getTime() - chunk.timestamp.getTime()
    const age_days = age_ms / (1000 * 60 * 60 * 24)
    return age_days
  })

  const avg_age = ages.reduce((a, b) => a + b, 0) / ages.length

  // Score: 1.0 for same-day content, decays over time
  return Math.max(0, 1 - (avg_age / 365))
}

// Target: >0.7 for time-sensitive applications

Advanced Metrics for RAG Systems

7. Answer Relevance

Measures whether the generated answer actually addresses the query, regardless of context quality.

async function measure_answer_relevance(query: string, answer: string) {
  const score = await llm.generate({
    system: "Rate how well this answer addresses the query. Return 0-10.",
    prompt: `Query: ${query}\n\nAnswer: ${answer}`
  })

  return parseFloat(score) / 10
}

// Target: >0.85

8. Context Sufficiency

Measures whether the retrieved context contains enough information to answer the query.

async function measure_sufficiency(query: string, context: string) {
  const result = await llm.generate({
    system: "Can this query be fully answered using only the provided context? Answer yes/no and explain.",
    prompt: `Query: ${query}\n\nContext: ${context}`
  })

  return result.toLowerCase().includes("yes") ? 1.0 : 0.0
}

// Target: >90%

9. Chunk Boundary Quality

Measures how well chunks respect semantic boundaries (sentences, paragraphs, topics).

function measure_boundary_quality(chunks: Chunk[]) {
  let good_boundaries = 0

  for (const chunk of chunks) {
    const text = chunk.text.trim()

    // Check if chunk ends at sentence boundary
    const ends_with_period = /[.!?]$/.test(text)

    // Check if chunk doesn't split mid-word
    const no_truncation = !text.endsWith("-") && !text.endsWith("...")

    // Check if chunk is semantically complete
    const complete = ends_with_period && no_truncation

    if (complete) good_boundaries++
  }

  return good_boundaries / chunks.length
}

// Target: >85%

AI Observability Framework

AI observability is the continuous practice of tracing AI workflows end to end, evaluating quality online and offline, routing ambiguous cases to human review, and alerting on user-impacting issues.

Monitoring Stack

class ContextQualityMonitor {
  async track_query(query: string, context: Chunk[], output: string) {
    const metrics = {
      timestamp: Date.now(),
      query_id: generate_id(),

      // Retrieval metrics
      precision: await this.measure_precision(query, context),
      coherence: await this.measure_coherence(context),
      freshness: this.measure_freshness(context, new Date()),

      // Generation metrics
      utilization: await this.measure_utilization(context, output),
      faithfulness: await this.measure_faithfulness(context, output),
      answer_relevance: await this.measure_answer_relevance(query, output),

      // Metadata
      num_chunks: context.length,
      total_tokens: count_tokens(context),
      latency_ms: this.latency
    }

    await this.log_metrics(metrics)

    // Alert on quality issues
    if (metrics.faithfulness &lt; 0.8) {
      await this.alert("Low faithfulness score", metrics)
    }

    return metrics
  }
}

Dashboards

Track these aggregates over time:

Metric	Aggregation	Alert Threshold
Precision	P50, P95, P99	P50 < 0.75
Faithfulness	Mean, Min	Mean < 0.90
Utilization	Mean	Mean < 0.50
Latency (P95)	P95	P95 > 3000ms
Cost per query	Mean, P95	Mean > $0.10

Automated Context Quality Testing

Synthetic Evaluation Pipeline

async function run_evaluation(test_set: TestCase[]) {
  const results = []

  for (const test of test_set) {
    // Retrieve context
    const context = await retrieve(test.query)

    // Generate answer
    const answer = await generate(test.query, context)

    // Score quality
    const metrics = {
      precision: await measure_precision(test.query, context),
      recall: calculate_recall(context, test.expected_chunks),
      faithfulness: await measure_faithfulness(context, answer),
      answer_correctness: compare_answers(answer, test.expected_answer)
    }

    results.push({ test, metrics })
  }

  return aggregate_metrics(results)
}

// Run daily or on every deployment
const baseline = await run_evaluation(test_set)
console.log("Precision:", baseline.precision)  // 0.83
console.log("Recall:", baseline.recall)        // 0.87
console.log("Faithfulness:", baseline.faithfulness)  // 0.94

Continuous Evaluation

Pre-deployment: Run evaluation suite, block if metrics drop >5%
Canary testing: Route 5% of traffic to new version, compare metrics
A/B testing: Split traffic 50/50, measure quality difference
Shadow mode: Run new system in parallel, compare outputs

Debugging Poor Context Quality

Diagnosis Framework

Symptom	Likely Cause	Fix
Low precision (<70%)	Poor retrieval ranking	Improve embedding model, add reranker, tune top-k
Low recall (<80%)	Missing chunks, poor chunking	Optimize chunk size/overlap, check indexing coverage
Low utilization (<50%)	Too much irrelevant context	Reduce top-k, improve filtering, add metadata filters
Low faithfulness (<90%)	Model hallucinating	Stronger grounding instructions, add citations, use RAG
Low coherence (<0.5)	Retrieving unrelated chunks	Improve query embedding, add query expansion, use MMR
High latency (>3s)	Too many chunks, slow embedding	Reduce top-k, use faster embedding model, add caching

Investigation Workflow

Identify failing queries: Find queries with quality scores below threshold
Inspect retrieved context: Manually review chunks - are they relevant?
Check expected chunks: Were the right chunks in the database?
Analyze embeddings: Visualize query and chunk embeddings in 2D space
Review output: Identify specific hallucinations or errors
Test fixes: Try different retrieval strategies, chunking, prompts
Re-evaluate: Measure metrics again to confirm improvement

Embedding Quality

When embeddings fail to represent the semantic meaning of the source data, AI will receive the wrong context regardless of vector database or model performance. Embedding quality is becoming a mission-critical priority in 2025.

Measuring Embedding Quality

async function evaluate_embeddings(test_pairs: SimilarityPair[]) {
  const results = []

  for (const pair of test_pairs) {
    const emb1 = await embed(pair.text1)
    const emb2 = await embed(pair.text2)

    const similarity = cosine_similarity(emb1, emb2)
    const expected = pair.should_be_similar ? 1 : 0

    results.push({
      predicted: similarity,
      expected: expected,
      error: Math.abs(similarity - expected)
    })
  }

  return {
    mean_error: mean(results.map(r => r.error)),
    accuracy: results.filter(r => r.error &lt; 0.3).length / results.length
  }
}

// Test with known similar/dissimilar pairs
const test_pairs = [
  { text1: "cat", text2: "kitten", should_be_similar: true },
  { text1: "cat", text2: "database", should_be_similar: false }
]

const quality = await evaluate_embeddings(test_pairs)
// Target: accuracy >85%, mean_error <0.2

Human-in-the-Loop Quality Assurance

Route ambiguous cases to human review to build ground truth datasets and catch edge cases.

When to Flag for Review

Faithfulness score < 0.85
Utilization < 0.40 (too much context waste)
User explicitly reports issue ("not helpful", "incorrect")
Conflicting information in retrieved chunks
High-stakes queries (legal, medical, financial)

Review Workflow

async function flag_for_review(query_id: string, reason: string) {
  await review_queue.add({
    query_id,
    reason,
    priority: calculate_priority(reason),
    assigned_to: null
  })
}

// Reviewer dashboard shows:
// - Original query
// - Retrieved context chunks
// - Generated answer
// - Quality metrics
// - Ask: "Is this answer correct and helpful? yes/no"

const feedback = await get_reviewer_feedback(query_id)

if (!feedback.approved) {
  // Update training data, retrain retrieval model
  await update_training_set(query_id, feedback)
}

Thread Transfer's Quality Scoring

Thread Transfer bundles are pre-scored for context quality before delivery:

Bundle Quality Metrics

Completeness: All referenced messages, links, and files included
Coherence: Conversation flow preserved, topics clearly identified
Decision extraction: Key decisions and rationale captured
Stakeholder identification: All participants and their roles noted
Action item accuracy: Todos, owners, and deadlines extracted
Token efficiency: 40-80% reduction vs. raw thread, no information loss

Bundles with quality scores below threshold are flagged for human review before delivery, ensuring every bundle meets production standards.

Production Checklist

Define target metrics: Precision, recall, faithfulness, utilization
Build test set: 100-500 queries with expected outputs
Instrument tracking: Log all queries with context and outputs
Set up dashboards: Real-time monitoring of quality metrics
Configure alerts: Trigger on drops below threshold
Run daily evals: Automated testing against baseline
Human review queue: Flag edge cases for expert feedback
Continuous improvement: Use feedback to retrain and refine

The Bottom Line

You can't improve what you don't measure. Context quality scoring transforms "AI is unreliable" into actionable metrics you can track, debug, and optimize. Systems that measure context quality reduce hallucinations by 70-90%, cut token costs by 40-60%, and improve user satisfaction by 30-50%.

Start with the core metrics: precision, recall, faithfulness, utilization. Instrument your system to track them on every query. Set up alerts for regressions. Build a test set and run evals daily. Add human review for edge cases.

The goal: measurable, improvable, production-grade context quality.

Learn more: How it works · Why bundles beat raw thread history