Thread Transfer
Chaos Engineering for AI Systems
What happens when your embedding service dies? When the LLM times out? Chaos engineering finds the answers before production does.
Jorgo Bardho
Founder, Thread Transfer
Your RAG system works flawlessly in testing. Then production happens: the vector database crashes during peak load. The LLM API returns 503 errors for 20 minutes. Your embedding service runs out of memory. Network latency between components spikes to 5 seconds. Each failure cascades through your AI pipeline in ways you never anticipated. Users see gibberish, hallucinations, or nothing at all. By the time you diagnose the issue, trust is damaged and revenue is lost.
This is why chaos engineering exists—to deliberately inject failures in controlled environments before they happen in production. For AI systems with complex pipelines, data dependencies, and emergent behaviors, chaos engineering isn't optional. This guide covers chaos testing strategies specifically designed for AI services: pipeline disruptions, network failures, model degradation scenarios, and automated frameworks that make chaos testing practical.
1. What Is Chaos Engineering?
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Instead of waiting for failures to happen, you intentionally inject them in controlled ways to:
- Identify weaknesses before users encounter them
- Verify that monitoring and alerting actually work
- Test that redundancy and failover mechanisms function correctly
- Build team muscle memory for incident response
Chaos Engineering vs Chaos Testing
Chaos testing refers to controlled experiments in staging or pre-production environments. Chaos engineering extends this to production systems, running controlled disruptions on live traffic to ensure real-world resilience. For AI systems, start with chaos testing in staging before graduating to production chaos engineering.
2. Why AI Systems Need Chaos Engineering
AI systems are uniquely vulnerable to cascading failures that traditional software doesn't face:
Complex Dependency Chains
A typical RAG system depends on: user input → query embedding → vector database retrieval → reranking → context construction → LLM generation → response formatting. Failure in any component propagates downstream. Chaos engineering tests whether your system degrades gracefully or collapses catastrophically.
External API Dependencies
Most AI systems call external APIs: OpenAI, Anthropic, Cohere, Pinecone, Weaviate. These services have SLAs—99.9% uptime means 43 minutes of downtime per month. Are you prepared for that? Chaos engineering tests your fallback strategies.
Data Pipeline Brittleness
AI relies on data pipelines: document ingestion, embedding generation, vector indexing, cache warming. What happens when:
- A batch job fails halfway through?
- Vector database indexes become corrupted?
- Embeddings drift due to model version mismatch?
Chaos engineering surfaces these failure modes in controlled settings.
Emergent Behavior Under Stress
AI systems degrade unpredictably under pressure. When GPU memory fills up, does your system queue requests, reject them cleanly, or crash? When latency spikes, do retries make the problem worse? Chaos engineering reveals behavior that only emerges under abnormal conditions.
3. Chaos Engineering Principles for AI
Start with a Steady State Hypothesis
Define what "normal" looks like with measurable metrics:
- p95 latency < 2 seconds
- Error rate < 1%
- RAG accuracy > 90%
- Hallucination rate < 5%
Vary Real-World Events
Inject failures that could actually happen in production:
- Vector database crashes or becomes unreachable
- LLM API returns 429 (rate limit) or 503 (service unavailable)
- Network latency between services increases 10x
- Embedding model version changes unexpectedly
- Document corpus becomes temporarily unavailable
Run Experiments in Production (Eventually)
Staging environments miss production-specific issues. Once confident, run controlled chaos in production with:
- Blast radius limits (affect 1-5% of traffic)
- Automated rollback triggers
- Time windows during low-traffic periods
- Kill switches to abort experiments instantly
Automate Experiments Continuously
Manual chaos testing doesn't scale. Use frameworks that run chaos experiments automatically and notify you only when failures deviate from expected behavior.
4. Common Chaos Experiments for AI Systems
Experiment 1: Vector Database Failure
Hypothesis: When the vector database is unavailable, the system returns informative errors rather than hallucinating.
Method: Inject network failures to the vector database for 5 minutes.
Expected behavior:
- System detects retrieval failure within 1 second
- Returns 503 error with message "Search temporarily unavailable"
- Does NOT attempt generation without retrieved context
- Resumes normal operation when database recovers
Implementation with Chaos Toolkit:
{
"title": "Vector DB failure resilience",
"description": "Test system behavior when vector database is unreachable",
"steady-state-hypothesis": {
"title": "System is healthy",
"probes": [
{
"type": "probe",
"name": "check-error-rate",
"tolerance": {
"type": "range",
"range": [0, 0.01]
},
"provider": {
"type": "http",
"url": "https://api.example.com/metrics/error_rate"
}
}
]
},
"method": [
{
"type": "action",
"name": "block-vectordb-traffic",
"provider": {
"type": "process",
"path": "iptables",
"arguments": [
"-A", "OUTPUT", "-p", "tcp",
"-d", "vectordb.example.com", "-j", "DROP"
]
},
"pauses": {
"after": 300
}
}
],
"rollbacks": [
{
"type": "action",
"name": "restore-vectordb-traffic",
"provider": {
"type": "process",
"path": "iptables",
"arguments": ["-D", "OUTPUT", "-p", "tcp", "-d", "vectordb.example.com", "-j", "DROP"]
}
}
]
}Experiment 2: LLM API Rate Limiting
Hypothesis: When LLM API returns 429 errors, the system implements exponential backoff and queues requests rather than failing immediately.
Method: Simulate rate limit responses from LLM API.
Expected behavior:
- First retry after 1 second
- Second retry after 2 seconds
- Third retry after 4 seconds
- After 3 retries, return error to user with retry-after header
Experiment 3: Network Latency Injection
Hypothesis: When network latency between services increases to 2 seconds, end-to-end latency remains below 10 seconds (services handle it gracefully).
Method: Use tc (traffic control) to add 2000ms latency to network packets.
# Add latency to network interface
tc qdisc add dev eth0 root netem delay 2000ms
# Run traffic for 10 minutes
sleep 600
# Remove latency
tc qdisc del dev eth0 rootExperiment 4: Partial Data Pipeline Failure
Hypothesis: When document ingestion fails mid-batch, the system rolls back changes and retries from the last successful checkpoint.
Method: Kill the ingestion worker process after 50% of a batch is processed.
Expected behavior:
- System detects worker failure within 30 seconds
- Rolls back partial batch (no corrupt embeddings in vector DB)
- Retries entire batch from beginning
- Completes successfully on retry
Experiment 5: Model Degradation Simulation
Hypothesis: When model quality drops (simulated by switching to a weaker model), monitoring alerts fire within 5 minutes.
Method: Temporarily route requests to GPT-3.5-turbo instead of GPT-4.
Expected behavior:
- Quality metrics (accuracy, hallucination rate) degrade detectably
- Monitoring system fires alert: "Model performance regression detected"
- On-call engineer receives notification
Experiment 6: Cascading Failure Test
Hypothesis: When embedding service crashes, it doesn't take down the entire system due to retry storms.
Method: Kill embedding service pods.
Expected behavior:
- System detects embedding service unavailability
- Circuit breaker opens after 5 consecutive failures
- Requests fail fast with 503 rather than queuing indefinitely
- System recovers automatically when embedding service restarts
5. Tools and Frameworks for AI Chaos Engineering
Chaos Toolkit
Open-source Python framework for chaos experiments. Google Cloud recommends it for distributed systems testing. Key features:
- Declarative experiment definitions (JSON/YAML)
- Extension libraries for Google Cloud, Kubernetes, AWS
- Automated rollback on experiment completion or failure
- Steady-state hypothesis validation before and after chaos
Krkn-AI (Red Hat)
AI-assisted chaos testing framework for Kubernetes. Released in 2025, it uses genetic algorithms to evolve chaos experiments automatically based on system behavior.
How it works:
- Define service-level objectives (SLOs) and health signals
- Krkn-AI generates chaos scenarios (pod kills, network delays, resource constraints)
- Measures impact against SLOs
- Evolves scenarios to find weaknesses that violate SLOs
- Provides objective-driven chaos without manual test design
Genqe.ai
AI-powered chaos engineering platform that models chaos experiments intelligently. Analyzes application architectures, traffic patterns, and dependencies to identify optimal failure injection points.
LitmusChaos
Cloud-native chaos engineering framework for Kubernetes. Provides pre-built chaos experiments for common failure scenarios:
- Pod deletion
- Network latency and packet loss
- CPU and memory stress
- Disk I/O throttling
AWS Fault Injection Simulator (FIS)
Managed service for running chaos experiments on AWS infrastructure. Supports:
- EC2 instance termination
- ECS task killing
- RDS database failover
- Network latency injection
6. Building an AI-Specific Chaos Experiment
Here's a complete example testing RAG pipeline resilience:
Scenario: Testing Embedding Service Failure
Setup (Python with Chaos Toolkit)
# chaos_experiments/embedding_failure.py
from chaoslib.types import Configuration, Secrets
import requests
import time
def check_system_health(configuration: Configuration = None) -> bool:
"""Verify system is healthy before chaos."""
metrics = requests.get("http://api.example.com/metrics").json()
return (
metrics["error_rate"] < 0.01 and
metrics["p95_latency"] < 3000 and
metrics["accuracy"] > 0.90
)
def inject_embedding_failure(duration: int = 300):
"""Kill embedding service pods for specified duration."""
import subprocess
# Scale embedding deployment to 0 replicas
subprocess.run([
"kubectl", "scale", "deployment", "embedding-service",
"--replicas=0", "-n", "production"
])
# Wait for chaos duration
time.sleep(duration)
def restore_embedding_service():
"""Restore embedding service."""
import subprocess
subprocess.run([
"kubectl", "scale", "deployment", "embedding-service",
"--replicas=3", "-n", "production"
])
def verify_graceful_degradation() -> bool:
"""Check that system handled failure gracefully."""
logs = requests.get("http://api.example.com/logs/last/300").json()
# Verify error responses were appropriate
error_count = sum(1 for log in logs if log["status"] == 503)
hallucination_count = sum(1 for log in logs if log.get("hallucination", False))
# System should return 503 errors, not hallucinate
return error_count > 0 and hallucination_count == 0Experiment Definition (JSON)
{
"title": "Embedding Service Resilience Test",
"description": "Verify RAG system gracefully handles embedding service failure",
"tags": ["ai", "rag", "embeddings"],
"steady-state-hypothesis": {
"title": "System is healthy",
"probes": [
{
"type": "probe",
"name": "system-health-check",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaos_experiments.embedding_failure",
"func": "check_system_health"
}
}
]
},
"method": [
{
"type": "action",
"name": "kill-embedding-service",
"provider": {
"type": "python",
"module": "chaos_experiments.embedding_failure",
"func": "inject_embedding_failure",
"arguments": {
"duration": 300
}
}
},
{
"type": "probe",
"name": "verify-graceful-degradation",
"tolerance": true,
"provider": {
"type": "python",
"module": "chaos_experiments.embedding_failure",
"func": "verify_graceful_degradation"
}
}
],
"rollbacks": [
{
"type": "action",
"name": "restore-embedding-service",
"provider": {
"type": "python",
"module": "chaos_experiments.embedding_failure",
"func": "restore_embedding_service"
}
}
]
}Running the Experiment
# Run chaos experiment
chaos run embedding_failure.json
# Expected output:
# [INFO] Steady state hypothesis: System is healthy
# [INFO] Probe: system-health-check succeeded
# [INFO] Action: kill-embedding-service
# [INFO] Probe: verify-graceful-degradation succeeded
# [INFO] Rollback: restore-embedding-service
# [INFO] Experiment completed successfully7. Monitoring and Observability During Chaos
Chaos experiments are only useful if you can observe their effects:
Real-Time Dashboards
Monitor these metrics during chaos experiments:
- Latency: p50, p95, p99 for all service endpoints
- Error rates: By error type (5xx, timeout, validation)
- Quality metrics: Accuracy, hallucination rate, citation correctness
- Resource utilization: CPU, memory, GPU for each service
- Dependency health: Status of vector DB, LLM API, embedding service
Distributed Tracing
Use tools like Jaeger or Datadog APM to trace requests through the AI pipeline during chaos. This shows exactly where failures occur and how they propagate.
Alert Validation
Chaos experiments should trigger your existing alerts. If they don't, your alerting is broken. For each experiment, verify:
- Alert fires within expected timeframe (e.g., 1 minute of failure)
- Alert message contains actionable information
- On-call engineer receives notification via PagerDuty/Slack
- Runbook linked in alert helps responder diagnose and fix
8. GameDays: Coordinated Chaos Testing
GameDays are scheduled chaos engineering sessions where teams deliberately disrupt systems and practice incident response.
Planning a GameDay
- Schedule: 2-4 hours during business hours (ensures full team availability)
- Scope: Define which systems are in scope for chaos
- Scenarios: Prepare 3-5 chaos experiments to run sequentially
- Roles: Assign chaos facilitator, incident responders, observers
- Communication: Notify stakeholders—this is intentional chaos, not a real incident
Example GameDay Agenda
- 0:00-0:15: Kickoff, review scenarios, verify monitoring is ready
- 0:15-0:45: Scenario 1 - Vector database failure
- 0:45-1:15: Scenario 2 - LLM API rate limiting
- 1:15-1:45: Scenario 3 - Network latency injection
- 1:45-2:15: Scenario 4 - Cascading failure (multiple components)
- 2:15-2:30: Debrief, document learnings, assign follow-up tasks
GameDay Outcomes
Successful GameDays result in:
- Discovered weaknesses documented in issue tracker
- Updated runbooks based on actual incident response
- Improved monitoring and alerting
- Team confidence in handling real incidents
9. Progressive Chaos: From Staging to Production
Don't start with production chaos. Build confidence progressively:
Phase 1: Local Development
Run chaos experiments on local dev environments. Test that your code handles failures gracefully before it reaches production.
Phase 2: Staging Environment
Run comprehensive chaos experiments in staging with full production-like infrastructure. Validate that monitoring, alerting, and failover mechanisms work.
Phase 3: Production Canary
Run chaos on a small percentage of production traffic (1-5%). Monitor impact closely and abort if unexpected behavior occurs.
Phase 4: Production at Scale
Run chaos on full production traffic during controlled windows. This is the ultimate resilience test—if your system handles this, it can handle real-world failures.
10. Common Failure Modes in AI Systems
Data Pipeline Failures
- Document ingestion crashes mid-batch
- Embedding generation times out for large documents
- Vector database index becomes corrupted
- Cache invalidation fails, serving stale data
Model Inference Failures
- GPU out-of-memory errors under load
- Model API rate limits exceeded
- Response timeout from slow LLM calls
- Malformed responses breaking parsing logic
Integration Failures
- Network partition between services
- Authentication token expiration mid-request
- Version mismatch between services (e.g., embedding model updated but retrieval still expects old embeddings)
11. Automating Chaos with CI/CD
Run chaos experiments automatically to catch regressions:
Scheduled Chaos
# .github/workflows/chaos-tests.yml
name: Weekly Chaos Testing
on:
schedule:
- cron: '0 14 * * 3' # Every Wednesday at 2 PM
jobs:
chaos:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Chaos Toolkit
run: pip install chaostoolkit
- name: Run chaos experiments
run: |
chaos run experiments/vectordb_failure.json
chaos run experiments/api_rate_limit.json
chaos run experiments/network_latency.json
- name: Upload results
uses: actions/upload-artifact@v3
with:
name: chaos-results
path: chaostoolkit.logPre-Deployment Chaos
Run chaos experiments before promoting to production:
# Before production deploy
steps:
- name: Deploy to staging
run: ./deploy_staging.sh
- name: Run chaos tests
run: chaos run experiments/full_suite.json
- name: Verify steady state restored
run: python verify_health.py
- name: Deploy to production
if: success()
run: ./deploy_production.sh12. Case Study: Preventing Pipeline Failures
An AI company used chaos engineering to test their document ingestion pipeline. They discovered that when the embedding service crashed mid-batch, partially-processed documents remained in the vector database with invalid embeddings. These corrupt entries caused retrieval to return irrelevant results, degrading RAG quality by 40%.
Fix: Implemented transactional batch processing with rollback on failure. If embedding generation fails for any document in a batch, the entire batch is rolled back and retried. Chaos testing validated the fix before production deployment.
Result: When a real embedding service outage occurred in production, the system gracefully queued batches and processed them after recovery—zero data corruption, zero quality degradation.
Key Takeaways
Chaos engineering for AI services tests resilience by deliberately injecting failures: vector database crashes, LLM API outages, network latency, and pipeline disruptions. Use frameworks like Chaos Toolkit, Krkn-AI, or LitmusChaos to automate chaos experiments. Start in staging, progress to production canary, and eventually run at scale.
Test common failure modes: dependency failures, cascading errors, data pipeline corruption, and model degradation. Monitor latency, error rates, and quality metrics during chaos to verify graceful degradation. Run GameDays to practice incident response and validate that alerting works.
Most importantly, treat chaos engineering as continuous practice, not one-time validation. Systems evolve, dependencies change, and new failure modes emerge. The teams that survive production incidents are the ones that practiced chaos engineering before the incidents happened.
Learn more: How it works · Why bundles beat raw thread history