Skip to main content

Thread Transfer

AI Regression Testing: Preventing Model Degradation

Every model update is a potential regression. Every prompt change risks breaking edge cases. Here's the regression testing playbook for AI systems.

Jorgo Bardho

Founder, Thread Transfer

August 17, 202517 min read
AI testingregression testingmodel updatesCI/CD
AI regression testing pipeline

You updated your LLM from GPT-4 to GPT-4.1, expecting better performance. Instead, your customer support bot started hallucinating product details that were previously accurate. Your RAG system that scored 92% accuracy last month now sits at 78%. A prompt tweak meant to improve formatting broke the entire intent classification pipeline. Welcome to AI regression testing—where traditional software testing patterns break down and teams need new frameworks to catch failures before users do.

This guide covers regression testing strategies specifically designed for AI systems: golden datasets, snapshot testing, pytest patterns, continuous evaluation pipelines, and how to catch model degradation before it hits production.

1. Why Traditional Regression Testing Fails for AI

In traditional software, regression testing is straightforward: run the same inputs, expect the same outputs. Change a function, verify existing behavior is preserved. This deterministic model collapses when applied to AI systems.

Non-Deterministic Outputs

LLMs with temperature > 0 produce different outputs on identical inputs. Exact string matching doesn't work. You need semantic equivalence checks, scoring rubrics, and tolerance bands that account for acceptable variance.

Emergent Behavior Changes

Upgrading a model version isn't like bumping a library dependency. Models can exhibit completely different reasoning patterns, verbosity levels, or failure modes despite identical prompts. Performance on one task can improve while another regresses—often unpredictably.

Data Drift Over Time

User queries evolve. Language patterns shift. What constituted "normal" input six months ago may not represent production traffic today. Static test suites become stale unless continuously refreshed with real-world samples.

Compounding Dependencies

RAG systems depend on retrieval quality, embeddings, reranking, and generation. A change in any component can cascade through the pipeline. Traditional unit testing misses these integration effects.

2. Golden Datasets: The Foundation of AI Regression Testing

Golden datasets are curated collections of input-output pairs that represent expected system behavior. They serve as regression test suites, capturing both common cases and critical edge cases.

Building Your Golden Dataset

Start with production samples that cover:

  • Common cases: The 80% of queries your system handles daily
  • Edge cases: Ambiguous inputs, boundary conditions, adversarial examples
  • Historical regressions: Inputs that previously caused failures
  • Domain-specific scenarios: Critical workflows unique to your application

Aim for 200-1000 examples initially, stratified across input types, complexity levels, and expected output categories. Quality matters more than quantity—each example should have clear evaluation criteria.

Annotating Golden Outputs

For each input, define what constitutes a "passing" output. This varies by task:

  • Classification: Exact label match (e.g., "spam" vs "not spam")
  • Generation: Reference answer with semantic similarity threshold (e.g., cosine similarity > 0.85)
  • Extraction: Set of required entities or fields that must be present
  • RAG: Combination of factual correctness, citation accuracy, and response relevance

Maintaining Golden Datasets

Datasets degrade over time. Establish a refresh cadence:

  • Monthly: Sample 50-100 recent production queries, have experts label them, add to golden set
  • After failures: Every production bug becomes a regression test case
  • Model updates: Re-validate entire golden set when upgrading models, update reference answers as needed
  • Pruning: Remove or archive cases that no longer reflect production usage

3. Snapshot Testing for LLM Outputs

Snapshot testing captures current outputs as baselines. Future runs compare new outputs against snapshots, flagging changes for human review. This catches unintended regressions without manually specifying every expected output.

How Snapshot Testing Works

  1. Run your AI system on a test input and save the output as a "snapshot"
  2. On subsequent runs, generate output again and compare to the snapshot
  3. If outputs match (within tolerance), test passes
  4. If outputs differ, test fails and prompts you to review the change
  5. If change is intentional (improved output), update the snapshot
  6. If change is a regression, fix the code and re-test

Implementing Snapshot Tests with pytest

Use the pytest-regressions plugin for Python-based AI systems:

import pytest
from my_rag_system import query_rag

def test_customer_support_query(data_regression):
    """Test that product question returns consistent answer."""
    query = "What is the return policy for electronics?"
    response = query_rag(query, temperature=0.0)

    # First run: saves snapshot. Subsequent runs: compares against it.
    data_regression.check({
        "answer": response["answer"],
        "sources": response["sources"],
        "confidence": response["confidence"]
    })

def test_semantic_similarity(data_regression):
    """Allow variance but check semantic equivalence."""
    query = "How do I reset my password?"
    response = query_rag(query, temperature=0.3)

    # Use semantic similarity instead of exact match
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Load previous snapshot
    previous_answer = load_snapshot("reset_password_answer.txt")
    similarity = model.encode([response["answer"], previous_answer])

    assert cosine_similarity(similarity[0], similarity[1]) > 0.90, \
        "Semantic similarity too low - answer may have regressed"

When to Use Snapshot Testing

Snapshots work best for:

  • Catching unintended changes when refactoring code or upgrading dependencies
  • Systems with temperature=0 where outputs are relatively stable
  • Baseline establishment when you're not sure what "correct" output looks like yet

Avoid snapshots for:

  • High-variance outputs (temperature > 0.7)
  • Creative generation tasks where many correct outputs exist
  • Systems where output format changes frequently

4. Semantic Similarity Testing

For non-deterministic LLM outputs, semantic similarity measures whether new outputs preserve meaning even if wording differs.

Using Embedding Models

Encode both reference and generated outputs using a sentence embedding model, then compute cosine similarity:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-mpnet-base-v2')

def test_answer_semantic_equivalence():
    reference = "Our return policy allows 30 days for electronics with original packaging."
    generated = "You can return electronics within 30 days if in original packaging."

    emb1 = model.encode(reference, convert_to_tensor=True)
    emb2 = model.encode(generated, convert_to_tensor=True)

    similarity = util.cos_sim(emb1, emb2).item()
    assert similarity > 0.85, f"Semantic similarity {similarity} below threshold"

Choosing Similarity Thresholds

Calibrate thresholds based on your use case:

  • 0.95-1.0: Nearly identical meaning, appropriate for factual accuracy requirements
  • 0.85-0.95: Same core meaning with minor phrasing differences, good for most production systems
  • 0.70-0.85: Related but possibly divergent, flag for human review
  • < 0.70: Likely regression, automatic failure

5. pytest Strategies for AI Systems

pytest is the standard Python testing framework. Here's how to adapt it for AI regression testing:

Parameterized Tests Across Golden Datasets

import pytest
from my_classifier import classify_intent

# Load golden dataset
GOLDEN_EXAMPLES = [
    ("I want to cancel my subscription", "cancel_request"),
    ("How do I upgrade to premium?", "upgrade_inquiry"),
    ("My payment failed", "payment_issue"),
    # ... 200 more examples
]

@pytest.mark.parametrize("text,expected_intent", GOLDEN_EXAMPLES)
def test_intent_classification(text, expected_intent):
    """Test all golden examples in one parameterized test."""
    result = classify_intent(text)
    assert result["intent"] == expected_intent, \
        f"Expected {expected_intent}, got {result['intent']}"

Fixtures for Model Loading

Use pytest fixtures to avoid reloading models on every test:

@pytest.fixture(scope="module")
def rag_system():
    """Load RAG system once per test module."""
    from my_rag import RAGSystem
    system = RAGSystem(model="gpt-4o-mini")
    yield system
    # Cleanup if needed
    system.close()

def test_product_query(rag_system):
    response = rag_system.query("What colors is the t-shirt available in?")
    assert "blue" in response.lower()
    assert "black" in response.lower()

Custom Assertions for AI Outputs

def assert_factual_consistency(response, ground_truth_facts):
    """Check that response contains required facts."""
    for fact in ground_truth_facts:
        assert fact.lower() in response.lower(), \
            f"Missing required fact: {fact}"

def assert_no_hallucination(response, allowed_sources):
    """Verify response only references allowed sources."""
    # Extract citations from response
    citations = extract_citations(response)
    for citation in citations:
        assert citation in allowed_sources, \
            f"Hallucinated citation: {citation}"

6. Testing Model Version Upgrades

Model upgrades are high-risk changes. Here's a systematic testing protocol:

Parallel Evaluation

Run old and new models side-by-side on the golden dataset:

def test_model_upgrade_regression():
    """Compare GPT-4o-mini vs GPT-4.1 on golden set."""
    old_model = OpenAI(model="gpt-4o-mini")
    new_model = OpenAI(model="gpt-4.1")

    improvements = 0
    regressions = 0

    for input_text, reference in GOLDEN_DATASET:
        old_output = old_model.generate(input_text)
        new_output = new_model.generate(input_text)

        old_score = evaluate_quality(old_output, reference)
        new_score = evaluate_quality(new_output, reference)

        if new_score > old_score:
            improvements += 1
        elif new_score < old_score:
            regressions += 1
            log_regression(input_text, old_output, new_output)

    regression_rate = regressions / len(GOLDEN_DATASET)
    assert regression_rate &lt; 0.05, \
        f"Regression rate {regression_rate} exceeds 5% threshold"

A/B Testing in Staging

Before promoting to production, run A/B tests in staging:

  • Split traffic 50/50 between old and new models
  • Track metrics: accuracy, latency, user satisfaction, error rate
  • Require statistical significance before promoting (p < 0.05, minimum 1000 samples per variant)
  • Set non-negotiable thresholds: new model must not increase latency > 10%, must not decrease accuracy > 2%

7. Continuous Evaluation Pipelines

Regression testing isn't a one-time event—it's a continuous process. Build pipelines that run automatically:

Nightly Regression Runs

Schedule pytest runs against your golden dataset every night:

# .github/workflows/regression-tests.yml
name: AI Regression Tests
on:
  schedule:
    - cron: '0 2 * * *'  # 2 AM daily
  pull_request:
    paths:
      - 'src/ai_system/**'

jobs:
  regression-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run pytest regression tests
        run: pytest tests/regression/ -v --tb=short
      - name: Upload failure logs
        if: failure()
        uses: actions/upload-artifact@v3
        with:
          name: regression-failures
          path: logs/

Production Monitoring as Regression Detection

Monitor live production metrics to catch degradation:

  • Daily: Track accuracy, confidence distribution, latency p95, error rate
  • Weekly: Sample 100 production outputs for human quality review
  • Monthly: Deep dive on metric trends, refresh golden dataset with production samples

Alerting on Degradation

Set up automated alerts for regression signals:

# Example: Datadog monitor
if avg(last_4h):avg:ai.accuracy{env:prod} &lt; 0.85 {
  alert: "AI accuracy dropped below 85% - possible regression"
}

if avg(last_1h):p95:ai.latency_ms{env:prod} &gt; 3000 {
  alert: "AI p95 latency exceeded 3s - performance regression"
}

8. Testing RAG Systems Specifically

RAG introduces additional regression vectors: retrieval quality, context relevance, citation accuracy.

Retrieval Quality Tests

def test_retrieval_quality():
    """Ensure top-k retrieved docs contain relevant information."""
    query = "What is the refund policy?"
    docs = retriever.retrieve(query, k=5)

    # Check that at least one doc contains policy keywords
    relevant_keywords = ["refund", "return", "money back"]
    found = any(
        any(keyword in doc.text.lower() for keyword in relevant_keywords)
        for doc in docs
    )
    assert found, "No relevant documents retrieved"

def test_retrieval_ndcg():
    """Measure ranking quality using NDCG."""
    from sklearn.metrics import ndcg_score

    for query, relevant_doc_ids in GOLDEN_RETRIEVAL_SET:
        retrieved = retriever.retrieve(query, k=10)
        retrieved_ids = [doc.id for doc in retrieved]

        # Create relevance labels
        labels = [1 if doc_id in relevant_doc_ids else 0
                  for doc_id in retrieved_ids]

        ndcg = ndcg_score([labels], [list(range(10, 0, -1))])
        assert ndcg &gt; 0.8, f"NDCG {ndcg} below threshold for query: {query}"

Citation Accuracy Tests

def test_citation_accuracy():
    """Verify that claims are grounded in retrieved context."""
    query = "What are the system requirements?"
    response = rag_system.query(query)

    # Extract claims and citations
    claims = extract_claims(response.answer)
    context = response.context

    for claim in claims:
        grounded = is_claim_in_context(claim, context)
        assert grounded, f"Ungrounded claim detected: {claim}"

End-to-End RAG Pipeline Tests

def test_rag_pipeline_integration():
    """Test full RAG pipeline from query to final answer."""
    query = "How do I change my email address?"

    # Step 1: Retrieval
    docs = retriever.retrieve(query, k=3)
    assert len(docs) &gt; 0, "Retrieval failed"

    # Step 2: Context construction
    context = build_context(docs)
    assert len(context) &lt; 4000, "Context too long"

    # Step 3: Generation
    answer = generator.generate(query, context)
    assert answer is not None, "Generation failed"

    # Step 4: Quality checks
    assert_factual_consistency(answer, docs)
    assert_no_hallucination(answer, docs)
    assert len(answer) &gt; 50, "Answer too short"

9. pytest-ML 2.0 and Specialized AI Testing Frameworks

In 2025, specialized frameworks emerged for AI testing. pytest-ML 2.0, released in January 2025, extends pytest with AI-specific features.

Key Features of pytest-ML 2.0

  • Model versioning: Automatic tracking of which model version produced which test results
  • Semantic assertions: Built-in similarity checks, hallucination detection, bias testing
  • Performance profiling: Integrated latency and memory tracking per test
  • 30% faster execution: Optimized for individual model tests with caching and parallel execution

Example Usage

import pytest_ml

@pytest_ml.model_test(model_version="gpt-4.1")
def test_sentiment_classification():
    """Test sentiment classification with semantic validation."""
    result = classify_sentiment("This product exceeded my expectations!")

    pytest_ml.assert_label(result, "positive")
    pytest_ml.assert_confidence(result, min_threshold=0.9)
    pytest_ml.assert_no_bias(result, protected_attributes=["gender", "race"])

10. Testing for Data Drift

Production data evolves. Tests that pass today may fail tomorrow as user behavior changes.

Detecting Distribution Shift

from scipy.stats import ks_2samp

def test_input_distribution_drift():
    """Check if recent production inputs match training distribution."""
    train_embeddings = load_embeddings("training_data.npy")
    prod_embeddings = get_recent_prod_embeddings(days=7)

    # Kolmogorov-Smirnov test for distribution shift
    statistic, p_value = ks_2samp(train_embeddings, prod_embeddings)

    assert p_value &gt; 0.01, \
        f"Significant distribution drift detected (p={p_value})"

Concept Drift Detection

Track model accuracy over time windows:

def test_rolling_accuracy():
    """Ensure accuracy hasn't degraded over past 30 days."""
    daily_accuracies = get_daily_accuracy_scores(days=30)

    recent_accuracy = np.mean(daily_accuracies[-7:])
    baseline_accuracy = np.mean(daily_accuracies[:7])

    degradation = baseline_accuracy - recent_accuracy
    assert degradation &lt; 0.05, \
        f"Accuracy degraded by {degradation*100:.1f}%"

11. Handling Flaky AI Tests

Stochastic outputs cause test flakiness. Strategies to manage:

Use temperature=0 for Determinism

Eliminate randomness where possible:

def test_deterministic_output():
    response = llm.generate(prompt, temperature=0.0, seed=42)
    # Now output is reproducible

Run Multiple Trials

For tests requiring temperature > 0, run multiple samples and check statistical properties:

def test_average_quality():
    """Test average quality over 10 runs."""
    scores = []
    for _ in range(10):
        output = llm.generate(prompt, temperature=0.7)
        scores.append(evaluate_quality(output))

    avg_score = np.mean(scores)
    assert avg_score &gt; 0.85, f"Average quality {avg_score} below threshold"

Retry with Jitter

For API failures or transient issues:

@pytest.mark.flaky(reruns=3, reruns_delay=2)
def test_api_endpoint():
    response = api_call()
    assert response.status_code == 200

12. Regression Testing Checklist

Before deploying any AI change, verify:

  • Golden dataset pass rate: > 95% of examples pass
  • No new hallucinations: Citation accuracy maintained or improved
  • Latency regression: p95 latency increase < 10%
  • Quality metrics stable: Accuracy, F1, NDCG within 2% of baseline
  • Edge case coverage: All historical bugs still handled correctly
  • Human review sample: 50-100 outputs manually checked for quality
  • A/B test in staging: Statistical significance favoring new version

13. Tools and Frameworks Summary

Testing Frameworks

  • pytest: Standard Python testing framework, extensible with plugins
  • pytest-regressions: Snapshot testing plugin for regression detection
  • pytest-ML 2.0: AI-specific testing framework with semantic assertions

Evaluation Libraries

  • sentence-transformers: Semantic similarity using embeddings
  • RAGAS: RAG-specific evaluation metrics (faithfulness, relevance, context recall)
  • LangSmith: Observability and evaluation platform with golden dataset management

Continuous Testing

  • GitHub Actions: CI/CD automation for scheduled regression runs
  • Weights & Biases: Experiment tracking with automated regression detection
  • Arize AI: Production monitoring with drift detection and alerting

Key Takeaways

Regression testing for AI systems requires new patterns beyond traditional software testing. Build golden datasets covering common and edge cases, use snapshot testing to detect unintended changes, implement semantic similarity checks for non-deterministic outputs, and establish continuous evaluation pipelines that run nightly and monitor production.

Use pytest with AI-specific extensions like pytest-ML 2.0 for semantic assertions and model versioning. Test RAG systems across all components—retrieval, context construction, generation, and citations. Detect data drift and concept drift before they degrade production performance.

Most importantly, treat every production failure as a regression test case. The best test suite is one that prevents yesterday's bugs from recurring tomorrow.