Thread Transfer

Load Testing AI Systems: Latency Under Pressure

P50 latency looks fine. P99 is a disaster. Here's how to load test AI systems and plan capacity for production traffic.

Jorgo Bardho

Founder, Thread Transfer

August 18, 2025•16 min read

load testinglatencyperformancecapacity planningproduction

Your RAG system handles 50 requests per hour in development. It's fast, accurate, and elegant. Then you launch. Within the first hour, 500 users hit it simultaneously. Latency spikes to 30 seconds. Half the requests time out. GPU memory runs out. The system crashes. You frantically scale infrastructure while support tickets flood in. This scenario plays out weekly across AI teams that skip load testing.

Load testing AI endpoints isn't optional—it's the difference between controlled launches and production disasters. This guide covers complete load testing strategies using k6 and Locust: measuring latency under realistic load, testing throughput limits, monitoring GPU utilization, and identifying failure modes before users encounter them.

1. Why AI Endpoints Need Different Load Testing

Traditional API load testing focuses on request throughput and response times. AI endpoints add complexity that breaks standard assumptions:

Variable Response Times

Unlike REST APIs that return in milliseconds, LLM inference can take 500ms to 10 seconds depending on output length, model size, and GPU availability. Simple throughput metrics miss this variance. You need percentile latency (p50, p95, p99) and distribution analysis.

Expensive Compute Resources

Each AI request consumes GPU compute, memory, and potentially expensive API calls. A traditional load test that fires 10,000 requests could cost hundreds of dollars in inference charges. Load testing requires careful scoping and cost controls.

State and Context Dependencies

RAG systems maintain vector databases, embedding models, and context caches. Load testing must simulate realistic query patterns—not just random strings—and account for cache hit rates, vector database load, and embedding generation bottlenecks.

Graceful Degradation vs Hard Failures

AI systems often degrade gradually under load. Quality drops before complete failure. A model might start hallucinating when GPU memory pressure forces batch size reduction, or retrieval quality suffers when the vector database is saturated. Traditional pass/fail metrics miss these quality regressions.

2. k6 vs Locust: Choosing Your Tool

Both k6 and Locust are industry-standard load testing tools. Here's when to use each:

k6 Overview

k6 is a developer-centric, open-source load testing tool written in Go with JavaScript test scripts. It's lightweight, handles high concurrency with minimal resources, and integrates seamlessly with Grafana for visualization.

k6 Strengths

Performance: Handles 10-100x more requests per machine than Locust due to Go's efficiency
CI/CD integration: Easy to run in pipelines with automated thresholds
Protocol support: Native gRPC support, WebSockets, HTTP/2
Scripting simplicity: JavaScript is familiar to most developers

k6 Limitations

Less flexible than Python for complex test logic
Smaller plugin ecosystem compared to Locust

Locust Overview

Locust is a Python-based load testing framework where you define user behaviors using Python code. It's flexible, scriptable, and has a simple web-based UI for monitoring tests in real-time.

Locust Strengths

Python ecosystem: Easy integration with ML libraries, data processing, custom authentication
Flexibility: Complex test scenarios, state management, dynamic payloads
Event-driven: Asynchronous request handling for realistic user simulation
Web UI: Real-time visualization of metrics without external tools

Locust Limitations

Python's synchronous behavior can limit RPS per machine
Higher resource consumption than k6 for equivalent load

Recommendation

Use k6 if: You need maximum throughput per test machine, gRPC support, or tight CI/CD integration.

Use Locust if: You need complex test logic, Python integration with your AI stack, or rapid prototyping of test scenarios.

3. Load Testing with Locust: Complete Example

Here's a production-ready Locust test for a RAG API endpoint:

Basic Locust Test

from locust import HttpUser, task, between
import random
import json

class RAGUser(HttpUser):
    """Simulate users querying a RAG system."""

    # Wait 1-3 seconds between requests per user
    wait_time = between(1, 3)

    # Sample queries representing realistic production traffic
    queries = [
        "What is the return policy?",
        "How do I reset my password?",
        "What are the system requirements?",
        "Can I upgrade my plan?",
        "Where is my order?",
    ]

    def on_start(self):
        """Called once per user when they start."""
        # Simulate login or auth token acquisition
        response = self.client.post("/auth/token", json={
            "username": "loadtest_user",
            "password": "test_password"
        })
        self.token = response.json()["access_token"]

    @task(3)  # Weight: 3x more likely than other tasks
    def query_rag(self):
        """Send a query to the RAG endpoint."""
        query = random.choice(self.queries)

        with self.client.post(
            "/api/v1/query",
            json={"query": query, "top_k": 5},
            headers={"Authorization": f"Bearer {self.token}"},
            catch_response=True
        ) as response:
            # Check response time
            if response.elapsed.total_seconds() &gt; 5.0:
                response.failure(f"Response too slow: {response.elapsed.total_seconds()}s")
            # Check response quality
            elif response.status_code == 200:
                data = response.json()
                if not data.get("answer"):
                    response.failure("Empty answer returned")
                elif len(data.get("sources", [])) == 0:
                    response.failure("No sources provided")
                else:
                    response.success()
            else:
                response.failure(f"Status code: {response.status_code}")

    @task(1)
    def complex_query(self):
        """Test with longer, more complex queries."""
        query = "Can you explain the differences between the Pro and Enterprise plans, including pricing, features, and support options?"
        self.client.post("/api/v1/query", json={"query": query})

Running the Locust Test

# Headless mode (CI/CD)
locust -f locustfile.py --headless \
  --users 100 \
  --spawn-rate 10 \
  --run-time 10m \
  --host https://api.example.com \
  --csv results

# With web UI (interactive)
locust -f locustfile.py --host https://api.example.com
# Then open http://localhost:8089 in browser

Advanced Locust Patterns

Realistic Query Distribution

import numpy as np

class RAGUser(HttpUser):
    def on_start(self):
        # Load production query distribution from logs
        with open("production_queries.txt") as f:
            self.queries = [line.strip() for line in f.readlines()]

        # Weight queries by frequency (power law distribution)
        self.query_weights = np.random.zipf(1.5, len(self.queries))

    @task
    def query_rag(self):
        # Sample query based on production distribution
        query = random.choices(self.queries, weights=self.query_weights)[0]
        self.client.post("/api/v1/query", json={"query": query})

Monitoring GPU Utilization

from locust import events
import psutil
import GPUtil

@events.test_start.add_listener
def on_test_start(environment, **kwargs):
    """Log GPU stats at test start."""
    gpus = GPUtil.getGPUs()
    for gpu in gpus:
        print(f"GPU {gpu.id}: {gpu.memoryUtil*100:.1f}% memory, {gpu.load*100:.1f}% load")

@events.test_stop.add_listener
def on_test_stop(environment, **kwargs):
    """Log final GPU stats."""
    gpus = GPUtil.getGPUs()
    for gpu in gpus:
        print(f"Final GPU {gpu.id}: {gpu.memoryUtil*100:.1f}% memory")

4. Load Testing with k6: Complete Example

Basic k6 Test

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate } from 'k6/metrics';

// Custom metrics
const errorRate = new Rate('errors');

// Load test configuration
export const options = {
  stages: [
    { duration: '2m', target: 50 },   // Ramp up to 50 users
    { duration: '5m', target: 50 },   // Stay at 50 users
    { duration: '2m', target: 100 },  // Ramp up to 100
    { duration: '5m', target: 100 },  // Stay at 100
    { duration: '2m', target: 0 },    // Ramp down
  ],
  thresholds: {
    'http_req_duration': ['p(95)<3000'],  // 95% of requests under 3s
    'http_req_failed': ['rate<0.05'],     // Error rate under 5%
    'errors': ['rate<0.1'],               // Custom error rate under 10%
  },
};

const queries = [
  "What is the return policy?",
  "How do I reset my password?",
  "What are the system requirements?",
];

export function setup() {
  // Authenticate once before test
  const authResp = http.post('https://api.example.com/auth/token', {
    username: 'loadtest',
    password: 'test123',
  });
  return { token: authResp.json('access_token') };
}

export default function(data) {
  const query = queries[Math.floor(Math.random() * queries.length)];

  const payload = JSON.stringify({ query: query, top_k: 5 });
  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${data.token}`,
    },
  };

  const response = http.post('https://api.example.com/api/v1/query', payload, params);

  // Verify response
  const success = check(response, {
    'status is 200': (r) => r.status === 200,
    'has answer': (r) => JSON.parse(r.body).answer !== undefined,
    'response time &lt; 5s': (r) => r.timings.duration &lt; 5000,
  });

  errorRate.add(!success);

  sleep(1); // Think time between requests
}

Running k6 Tests

# Local execution
k6 run loadtest.js

# Cloud execution with streaming to Grafana
k6 run --out influxdb=http://localhost:8086/k6 loadtest.js

# CI/CD integration with automated thresholds
k6 run --out json=results.json loadtest.js
# Test fails if thresholds are exceeded

Testing gRPC AI Endpoints with k6

import grpc from 'k6/net/grpc';
import { check } from 'k6';

const client = new grpc.Client();
client.load(['definitions'], 'rag_service.proto');

export default function() {
  client.connect('ai-api.example.com:50051', { plaintext: true });

  const response = client.invoke('rag.RagService/Query', {
    query: 'What is the pricing?',
    top_k: 5,
  });

  check(response, {
    'status is OK': (r) => r && r.status === grpc.StatusOK,
    'answer exists': (r) => r.message.answer.length &gt; 0,
  });

  client.close();
}

5. Key Metrics to Monitor

Track these metrics during every load test:

Latency Metrics

p50 (median): Half of requests are faster than this—typical user experience
p95: 95% of requests complete within this time—good UX threshold
p99: Worst-case latency for 99% of users—critical for SLAs
Max latency: Absolute worst case—identifies outliers

Throughput Metrics

Requests per second (RPS): Total request volume the system handles
Successful RPS: Only counting 200 responses, excluding errors
Token throughput: For LLMs, track input tokens/sec and output tokens/sec

Error Metrics

Error rate: Percentage of failed requests (4xx, 5xx, timeouts)
Timeout rate: Requests exceeding timeout threshold
Quality failures: Empty responses, hallucinations, missing sources

Resource Metrics

GPU utilization: % of GPU compute used, memory consumption
CPU usage: For retrieval, embeddings, and orchestration logic
Memory: RAM usage, especially for vector database caching
Database load: Vector DB query latency and throughput

6. Load Testing Strategies

Baseline Testing

Establish normal performance with minimal load (1-10 users). This provides a reference for comparison.

Stress Testing

Gradually increase load until the system breaks. Identify the breaking point and failure mode. For AI systems, watch for:

GPU out-of-memory errors
Request queue saturation
Database connection pool exhaustion
API rate limit hits

Spike Testing

Simulate sudden traffic bursts (e.g., product launch, viral content). Ramp from 10 to 500 users in 30 seconds. Check if autoscaling responds fast enough and whether the system gracefully handles the spike or crashes.

Soak Testing

Run sustained load for hours or days to detect memory leaks, resource exhaustion, or gradual degradation. AI systems are particularly prone to slow memory leaks in embedding caches or vector stores.

Breakpoint Testing

Find the maximum sustainable load. Incrementally increase users until latency or error rate exceeds acceptable thresholds. This defines your capacity planning ceiling.

7. AI-Specific Load Testing Considerations

Testing for GPU Memory Saturation

AI models behave unpredictably when GPU memory fills up. Run load tests that deliberately push memory limits:

# Locust task that sends very long queries
@task
def long_query(self):
    # 2000-word query to maximize memory usage
    long_text = " ".join(["word"] * 2000)
    self.client.post("/api/v1/query", json={"query": long_text})

Monitor GPU memory during the test. When memory approaches 95%, check if the system:

Gracefully rejects requests with 503 errors
Queues requests and processes them when memory frees up
Crashes with OOM errors (unacceptable)

Testing Model Inference Batching

AI systems often batch requests to maximize GPU utilization. Test how batching affects latency:

Single request: Baseline latency
5 concurrent requests: Does batching improve throughput without excessive latency?
50 concurrent requests: Are batch sizes capped appropriately?

Testing Vector Database Under Load

RAG systems rely on fast vector similarity search. Load test specifically targets retrieval:

@task
def retrieval_only(self):
    """Test retrieval without generation to isolate vector DB performance."""
    query = random.choice(self.queries)
    response = self.client.post("/api/v1/retrieve", json={"query": query, "top_k": 10})

    # Track retrieval latency separately
    if response.elapsed.total_seconds() &gt; 0.5:
        response.failure(f"Retrieval too slow: {response.elapsed.total_seconds()}s")

Testing Cache Hit Rates

Many AI systems cache embeddings or LLM responses. Test with realistic query repetition:

# Simulate 30% cache hit rate (common in production)
@task
def query_with_cache(self):
    if random.random() &lt; 0.3:
        # Repeat previous query (should hit cache)
        query = self.last_query
    else:
        # New query
        query = random.choice(self.queries)
        self.last_query = query

    self.client.post("/api/v1/query", json={"query": query})

8. Interpreting Load Test Results

What Good Results Look Like

p95 latency under 3 seconds: Most users get fast responses
Error rate under 1%: System is stable under load
Linear scaling: Doubling infrastructure doubles capacity
Graceful degradation: Quality stays high even at peak load

Warning Signs

p95 latency > 10 seconds: User experience will suffer
Error rate > 5%: System is unstable, needs architectural fixes
Non-linear degradation: Small load increases cause large latency spikes (indicates bottleneck)
Memory leaks: Gradual memory growth over soak tests

Common Bottlenecks in AI Systems

GPU memory saturation: Reduce batch sizes, add more GPUs, or implement queuing
Vector DB slowness: Optimize indexes, increase vector DB resources, or use faster DB (Qdrant, Weaviate)
Embedding generation: Cache embeddings, use smaller embedding models, or batch embed requests
LLM API rate limits: Implement rate limiting on your side, use multiple API keys, or self-host models

9. Integrating Load Tests into CI/CD

Automate load testing to catch performance regressions before production:

GitHub Actions Example

name: Load Test
on:
  pull_request:
    paths:
      - 'src/ai_api/**'
  schedule:
    - cron: '0 2 * * 1'  # Weekly Monday 2 AM

jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install Locust
        run: pip install locust

      - name: Run load test
        run: |
          locust -f tests/load/locustfile.py --headless \
            --users 50 --spawn-rate 10 --run-time 5m \
            --host https://staging-api.example.com \
            --csv results/load_test

      - name: Check thresholds
        run: |
          python scripts/check_load_thresholds.py results/load_test_stats.csv

      - name: Upload results
        uses: actions/upload-artifact@v3
        with:
          name: load-test-results
          path: results/

Threshold Checking Script

# scripts/check_load_thresholds.py
import pandas as pd
import sys

df = pd.read_csv('results/load_test_stats.csv')

# Define thresholds
MAX_P95_LATENCY = 3000  # ms
MAX_ERROR_RATE = 0.05   # 5%

# Check p95 latency
p95_latency = df['95%'].max()
if p95_latency > MAX_P95_LATENCY:
    print(f"FAIL: p95 latency {p95_latency}ms exceeds {MAX_P95_LATENCY}ms")
    sys.exit(1)

# Check error rate
total_requests = df['Total Request Count'].sum()
failed_requests = df['Total Failure Count'].sum()
error_rate = failed_requests / total_requests if total_requests &gt; 0 else 0

if error_rate > MAX_ERROR_RATE:
    print(f"FAIL: Error rate {error_rate:.2%} exceeds {MAX_ERROR_RATE:.2%}")
    sys.exit(1)

print("PASS: All load test thresholds met")
sys.exit(0)

10. Cost Management for Load Testing

Load testing AI endpoints can be expensive. Strategies to control costs:

Use Staging Endpoints

Never load test production. Use dedicated staging environments with smaller, cheaper models for load testing.

Limit Test Duration

Short, focused tests (5-10 minutes) provide sufficient data without excessive API costs. Reserve long-duration soak tests for pre-launch validation.

Throttle Request Rate

You don't need 1000 RPS to find bottlenecks. Start with 10 RPS, then 50, then 100. Identify issues early before burning budget.

Use Synthetic Queries

Don't send full production query diversity. Use a representative sample of 50-100 queries. This reduces costs while still testing realistic patterns.

Key Takeaways

Load testing AI endpoints requires specialized strategies beyond traditional API testing. Use Locust for Python-integrated, flexible test scenarios or k6 for maximum performance and CI/CD integration. Monitor latency distributions (p50, p95, p99), not just averages. Track AI-specific metrics like GPU utilization, quality degradation, and cache hit rates.

Test for GPU memory saturation, vector database load, and batching behavior. Integrate load tests into CI/CD pipelines with automated threshold checks. Always test in staging environments, not production, and control costs with short tests and synthetic query sets.

The best time to find your system's breaking point is during load testing, not during your product launch.

Learn more: How it works · Why bundles beat raw thread history