Thread Transfer

Chaos Engineering for AI Systems

What happens when your embedding service dies? When the LLM times out? Chaos engineering finds the answers before production does.

Jorgo Bardho

Founder, Thread Transfer

August 19, 2025•18 min read

chaos engineeringresilienceAI infrastructurefault injection

Chaos engineering for AI systems diagram

Your RAG system works flawlessly in testing. Then production happens: the vector database crashes during peak load. The LLM API returns 503 errors for 20 minutes. Your embedding service runs out of memory. Network latency between components spikes to 5 seconds. Each failure cascades through your AI pipeline in ways you never anticipated. Users see gibberish, hallucinations, or nothing at all. By the time you diagnose the issue, trust is damaged and revenue is lost.

This is why chaos engineering exists—to deliberately inject failures in controlled environments before they happen in production. For AI systems with complex pipelines, data dependencies, and emergent behaviors, chaos engineering isn't optional. This guide covers chaos testing strategies specifically designed for AI services: pipeline disruptions, network failures, model degradation scenarios, and automated frameworks that make chaos testing practical.

1. What Is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Instead of waiting for failures to happen, you intentionally inject them in controlled ways to:

Identify weaknesses before users encounter them
Verify that monitoring and alerting actually work
Test that redundancy and failover mechanisms function correctly
Build team muscle memory for incident response

Chaos Engineering vs Chaos Testing

Chaos testing refers to controlled experiments in staging or pre-production environments. Chaos engineering extends this to production systems, running controlled disruptions on live traffic to ensure real-world resilience. For AI systems, start with chaos testing in staging before graduating to production chaos engineering.

2. Why AI Systems Need Chaos Engineering

AI systems are uniquely vulnerable to cascading failures that traditional software doesn't face:

Complex Dependency Chains

A typical RAG system depends on: user input → query embedding → vector database retrieval → reranking → context construction → LLM generation → response formatting. Failure in any component propagates downstream. Chaos engineering tests whether your system degrades gracefully or collapses catastrophically.

External API Dependencies

Most AI systems call external APIs: OpenAI, Anthropic, Cohere, Pinecone, Weaviate. These services have SLAs—99.9% uptime means 43 minutes of downtime per month. Are you prepared for that? Chaos engineering tests your fallback strategies.

Data Pipeline Brittleness

AI relies on data pipelines: document ingestion, embedding generation, vector indexing, cache warming. What happens when:

A batch job fails halfway through?
Vector database indexes become corrupted?
Embeddings drift due to model version mismatch?

Chaos engineering surfaces these failure modes in controlled settings.

Emergent Behavior Under Stress

AI systems degrade unpredictably under pressure. When GPU memory fills up, does your system queue requests, reject them cleanly, or crash? When latency spikes, do retries make the problem worse? Chaos engineering reveals behavior that only emerges under abnormal conditions.

3. Chaos Engineering Principles for AI

Start with a Steady State Hypothesis

Define what "normal" looks like with measurable metrics:

p95 latency < 2 seconds
Error rate < 1%
RAG accuracy > 90%
Hallucination rate < 5%

Vary Real-World Events

Inject failures that could actually happen in production:

Vector database crashes or becomes unreachable
LLM API returns 429 (rate limit) or 503 (service unavailable)
Network latency between services increases 10x
Embedding model version changes unexpectedly
Document corpus becomes temporarily unavailable

Run Experiments in Production (Eventually)

Staging environments miss production-specific issues. Once confident, run controlled chaos in production with:

Blast radius limits (affect 1-5% of traffic)
Automated rollback triggers
Time windows during low-traffic periods
Kill switches to abort experiments instantly

Automate Experiments Continuously

Manual chaos testing doesn't scale. Use frameworks that run chaos experiments automatically and notify you only when failures deviate from expected behavior.

4. Common Chaos Experiments for AI Systems

Experiment 1: Vector Database Failure

Hypothesis: When the vector database is unavailable, the system returns informative errors rather than hallucinating.

Method: Inject network failures to the vector database for 5 minutes.

Expected behavior:

System detects retrieval failure within 1 second
Returns 503 error with message "Search temporarily unavailable"
Does NOT attempt generation without retrieved context
Resumes normal operation when database recovers

Implementation with Chaos Toolkit:

{
  "title": "Vector DB failure resilience",
  "description": "Test system behavior when vector database is unreachable",
  "steady-state-hypothesis": {
    "title": "System is healthy",
    "probes": [
      {
        "type": "probe",
        "name": "check-error-rate",
        "tolerance": {
          "type": "range",
          "range": [0, 0.01]
        },
        "provider": {
          "type": "http",
          "url": "https://api.example.com/metrics/error_rate"
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "block-vectordb-traffic",
      "provider": {
        "type": "process",
        "path": "iptables",
        "arguments": [
          "-A", "OUTPUT", "-p", "tcp",
          "-d", "vectordb.example.com", "-j", "DROP"
        ]
      },
      "pauses": {
        "after": 300
      }
    }
  ],
  "rollbacks": [
    {
      "type": "action",
      "name": "restore-vectordb-traffic",
      "provider": {
        "type": "process",
        "path": "iptables",
        "arguments": ["-D", "OUTPUT", "-p", "tcp", "-d", "vectordb.example.com", "-j", "DROP"]
      }
    }
  ]
}

Experiment 2: LLM API Rate Limiting

Hypothesis: When LLM API returns 429 errors, the system implements exponential backoff and queues requests rather than failing immediately.

Method: Simulate rate limit responses from LLM API.

Expected behavior:

First retry after 1 second
Second retry after 2 seconds
Third retry after 4 seconds
After 3 retries, return error to user with retry-after header

Experiment 3: Network Latency Injection

Hypothesis: When network latency between services increases to 2 seconds, end-to-end latency remains below 10 seconds (services handle it gracefully).

Method: Use tc (traffic control) to add 2000ms latency to network packets.

# Add latency to network interface
tc qdisc add dev eth0 root netem delay 2000ms

# Run traffic for 10 minutes
sleep 600

# Remove latency
tc qdisc del dev eth0 root

Experiment 4: Partial Data Pipeline Failure

Hypothesis: When document ingestion fails mid-batch, the system rolls back changes and retries from the last successful checkpoint.

Method: Kill the ingestion worker process after 50% of a batch is processed.

Expected behavior:

System detects worker failure within 30 seconds
Rolls back partial batch (no corrupt embeddings in vector DB)
Retries entire batch from beginning
Completes successfully on retry

Experiment 5: Model Degradation Simulation

Hypothesis: When model quality drops (simulated by switching to a weaker model), monitoring alerts fire within 5 minutes.

Method: Temporarily route requests to GPT-3.5-turbo instead of GPT-4.

Expected behavior:

Quality metrics (accuracy, hallucination rate) degrade detectably
Monitoring system fires alert: "Model performance regression detected"
On-call engineer receives notification

Experiment 6: Cascading Failure Test

Hypothesis: When embedding service crashes, it doesn't take down the entire system due to retry storms.

Method: Kill embedding service pods.

Expected behavior:

System detects embedding service unavailability
Circuit breaker opens after 5 consecutive failures
Requests fail fast with 503 rather than queuing indefinitely
System recovers automatically when embedding service restarts

5. Tools and Frameworks for AI Chaos Engineering

Chaos Toolkit

Open-source Python framework for chaos experiments. Google Cloud recommends it for distributed systems testing. Key features:

Declarative experiment definitions (JSON/YAML)
Extension libraries for Google Cloud, Kubernetes, AWS
Automated rollback on experiment completion or failure
Steady-state hypothesis validation before and after chaos

Krkn-AI (Red Hat)

AI-assisted chaos testing framework for Kubernetes. Released in 2025, it uses genetic algorithms to evolve chaos experiments automatically based on system behavior.

How it works:

Define service-level objectives (SLOs) and health signals
Krkn-AI generates chaos scenarios (pod kills, network delays, resource constraints)
Measures impact against SLOs
Evolves scenarios to find weaknesses that violate SLOs
Provides objective-driven chaos without manual test design

Genqe.ai

AI-powered chaos engineering platform that models chaos experiments intelligently. Analyzes application architectures, traffic patterns, and dependencies to identify optimal failure injection points.

LitmusChaos

Cloud-native chaos engineering framework for Kubernetes. Provides pre-built chaos experiments for common failure scenarios:

Pod deletion
Network latency and packet loss
CPU and memory stress
Disk I/O throttling

AWS Fault Injection Simulator (FIS)

Managed service for running chaos experiments on AWS infrastructure. Supports:

EC2 instance termination
ECS task killing
RDS database failover
Network latency injection

6. Building an AI-Specific Chaos Experiment

Here's a complete example testing RAG pipeline resilience:

Scenario: Testing Embedding Service Failure

Setup (Python with Chaos Toolkit)

# chaos_experiments/embedding_failure.py
from chaoslib.types import Configuration, Secrets
import requests
import time

def check_system_health(configuration: Configuration = None) -> bool:
    """Verify system is healthy before chaos."""
    metrics = requests.get("http://api.example.com/metrics").json()
    return (
        metrics["error_rate"] &lt; 0.01 and
        metrics["p95_latency"] &lt; 3000 and
        metrics["accuracy"] &gt; 0.90
    )

def inject_embedding_failure(duration: int = 300):
    """Kill embedding service pods for specified duration."""
    import subprocess

    # Scale embedding deployment to 0 replicas
    subprocess.run([
        "kubectl", "scale", "deployment", "embedding-service",
        "--replicas=0", "-n", "production"
    ])

    # Wait for chaos duration
    time.sleep(duration)

def restore_embedding_service():
    """Restore embedding service."""
    import subprocess

    subprocess.run([
        "kubectl", "scale", "deployment", "embedding-service",
        "--replicas=3", "-n", "production"
    ])

def verify_graceful_degradation() -> bool:
    """Check that system handled failure gracefully."""
    logs = requests.get("http://api.example.com/logs/last/300").json()

    # Verify error responses were appropriate
    error_count = sum(1 for log in logs if log["status"] == 503)
    hallucination_count = sum(1 for log in logs if log.get("hallucination", False))

    # System should return 503 errors, not hallucinate
    return error_count &gt; 0 and hallucination_count == 0

Experiment Definition (JSON)

{
  "title": "Embedding Service Resilience Test",
  "description": "Verify RAG system gracefully handles embedding service failure",
  "tags": ["ai", "rag", "embeddings"],
  "steady-state-hypothesis": {
    "title": "System is healthy",
    "probes": [
      {
        "type": "probe",
        "name": "system-health-check",
        "tolerance": true,
        "provider": {
          "type": "python",
          "module": "chaos_experiments.embedding_failure",
          "func": "check_system_health"
        }
      }
    ]
  },
  "method": [
    {
      "type": "action",
      "name": "kill-embedding-service",
      "provider": {
        "type": "python",
        "module": "chaos_experiments.embedding_failure",
        "func": "inject_embedding_failure",
        "arguments": {
          "duration": 300
        }
      }
    },
    {
      "type": "probe",
      "name": "verify-graceful-degradation",
      "tolerance": true,
      "provider": {
        "type": "python",
        "module": "chaos_experiments.embedding_failure",
        "func": "verify_graceful_degradation"
      }
    }
  ],
  "rollbacks": [
    {
      "type": "action",
      "name": "restore-embedding-service",
      "provider": {
        "type": "python",
        "module": "chaos_experiments.embedding_failure",
        "func": "restore_embedding_service"
      }
    }
  ]
}

Running the Experiment

# Run chaos experiment
chaos run embedding_failure.json

# Expected output:
# [INFO] Steady state hypothesis: System is healthy
# [INFO] Probe: system-health-check succeeded
# [INFO] Action: kill-embedding-service
# [INFO] Probe: verify-graceful-degradation succeeded
# [INFO] Rollback: restore-embedding-service
# [INFO] Experiment completed successfully

7. Monitoring and Observability During Chaos

Chaos experiments are only useful if you can observe their effects:

Real-Time Dashboards

Monitor these metrics during chaos experiments:

Latency: p50, p95, p99 for all service endpoints
Error rates: By error type (5xx, timeout, validation)
Quality metrics: Accuracy, hallucination rate, citation correctness
Resource utilization: CPU, memory, GPU for each service
Dependency health: Status of vector DB, LLM API, embedding service

Distributed Tracing

Use tools like Jaeger or Datadog APM to trace requests through the AI pipeline during chaos. This shows exactly where failures occur and how they propagate.

Alert Validation

Chaos experiments should trigger your existing alerts. If they don't, your alerting is broken. For each experiment, verify:

Alert fires within expected timeframe (e.g., 1 minute of failure)
Alert message contains actionable information
On-call engineer receives notification via PagerDuty/Slack
Runbook linked in alert helps responder diagnose and fix

8. GameDays: Coordinated Chaos Testing

GameDays are scheduled chaos engineering sessions where teams deliberately disrupt systems and practice incident response.

Planning a GameDay

Schedule: 2-4 hours during business hours (ensures full team availability)
Scope: Define which systems are in scope for chaos
Scenarios: Prepare 3-5 chaos experiments to run sequentially
Roles: Assign chaos facilitator, incident responders, observers
Communication: Notify stakeholders—this is intentional chaos, not a real incident

Example GameDay Agenda

0:00-0:15: Kickoff, review scenarios, verify monitoring is ready
0:15-0:45: Scenario 1 - Vector database failure
0:45-1:15: Scenario 2 - LLM API rate limiting
1:15-1:45: Scenario 3 - Network latency injection
1:45-2:15: Scenario 4 - Cascading failure (multiple components)
2:15-2:30: Debrief, document learnings, assign follow-up tasks

GameDay Outcomes

Successful GameDays result in:

Discovered weaknesses documented in issue tracker
Updated runbooks based on actual incident response
Improved monitoring and alerting
Team confidence in handling real incidents

9. Progressive Chaos: From Staging to Production

Don't start with production chaos. Build confidence progressively:

Phase 1: Local Development

Run chaos experiments on local dev environments. Test that your code handles failures gracefully before it reaches production.

Phase 2: Staging Environment

Run comprehensive chaos experiments in staging with full production-like infrastructure. Validate that monitoring, alerting, and failover mechanisms work.

Phase 3: Production Canary

Run chaos on a small percentage of production traffic (1-5%). Monitor impact closely and abort if unexpected behavior occurs.

Phase 4: Production at Scale

Run chaos on full production traffic during controlled windows. This is the ultimate resilience test—if your system handles this, it can handle real-world failures.

10. Common Failure Modes in AI Systems

Data Pipeline Failures

Document ingestion crashes mid-batch
Embedding generation times out for large documents
Vector database index becomes corrupted
Cache invalidation fails, serving stale data

Model Inference Failures

GPU out-of-memory errors under load
Model API rate limits exceeded
Response timeout from slow LLM calls
Malformed responses breaking parsing logic

Integration Failures

Network partition between services
Authentication token expiration mid-request
Version mismatch between services (e.g., embedding model updated but retrieval still expects old embeddings)

11. Automating Chaos with CI/CD

Run chaos experiments automatically to catch regressions:

Scheduled Chaos

# .github/workflows/chaos-tests.yml
name: Weekly Chaos Testing
on:
  schedule:
    - cron: '0 14 * * 3'  # Every Wednesday at 2 PM

jobs:
  chaos:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install Chaos Toolkit
        run: pip install chaostoolkit
      - name: Run chaos experiments
        run: |
          chaos run experiments/vectordb_failure.json
          chaos run experiments/api_rate_limit.json
          chaos run experiments/network_latency.json
      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: chaos-results
          path: chaostoolkit.log

Pre-Deployment Chaos

Run chaos experiments before promoting to production:

# Before production deploy
steps:
  - name: Deploy to staging
    run: ./deploy_staging.sh
  - name: Run chaos tests
    run: chaos run experiments/full_suite.json
  - name: Verify steady state restored
    run: python verify_health.py
  - name: Deploy to production
    if: success()
    run: ./deploy_production.sh

12. Case Study: Preventing Pipeline Failures

An AI company used chaos engineering to test their document ingestion pipeline. They discovered that when the embedding service crashed mid-batch, partially-processed documents remained in the vector database with invalid embeddings. These corrupt entries caused retrieval to return irrelevant results, degrading RAG quality by 40%.

Fix: Implemented transactional batch processing with rollback on failure. If embedding generation fails for any document in a batch, the entire batch is rolled back and retried. Chaos testing validated the fix before production deployment.

Result: When a real embedding service outage occurred in production, the system gracefully queued batches and processed them after recovery—zero data corruption, zero quality degradation.

Key Takeaways

Chaos engineering for AI services tests resilience by deliberately injecting failures: vector database crashes, LLM API outages, network latency, and pipeline disruptions. Use frameworks like Chaos Toolkit, Krkn-AI, or LitmusChaos to automate chaos experiments. Start in staging, progress to production canary, and eventually run at scale.

Test common failure modes: dependency failures, cascading errors, data pipeline corruption, and model degradation. Monitor latency, error rates, and quality metrics during chaos to verify graceful degradation. Run GameDays to practice incident response and validate that alerting works.

Most importantly, treat chaos engineering as continuous practice, not one-time validation. Systems evolve, dependencies change, and new failure modes emerge. The teams that survive production incidents are the ones that practiced chaos engineering before the incidents happened.

Learn more: How it works · Why bundles beat raw thread history