Skip to main content

Thread Transfer

Building Production-Ready AI Agents

OpenTelemetry AI observability, semantic caching for 30-60% token reduction, horizontal scaling patterns, and blue-green deployments. The production agent playbook.

Jorgo Bardho

Founder, Thread Transfer

July 10, 202519 min read
AI agentsproductionobservabilityscalingsecurity
Production AI agent architecture

Enterprises spend $50-250M on generative AI initiatives in 2025. Arize raised $70M Series C in February. New Relic, Splunk, Cisco all launched agent observability products. OpenTelemetry established semantic conventions for LLM tracing. Yet most agent deployments fail in production. Demo agents work. Production agents face adversarial inputs, flaky tools, context window limits, cost constraints, security attacks, scale challenges. Traditional monitoring tracks infrastructure metrics. AI observability needs hallucination detection, reasoning evaluation, multi-step workflow tracing, governance controls. This is the comprehensive breakdown of deployment architecture, observability stack, scaling strategies, security hardening, and cost management for shipping reliable agents.

The production readiness gap: demo vs deployment

Demo agent: works with clean inputs, perfect tool availability, unlimited budget. Production agent: handles malicious inputs, fails gracefully when tools are down, operates within cost/latency budgets, scales to 10K+ users, complies with regulatory requirements.

The readiness checklist

CategoryDemo requirementProduction requirement
Input handlingClean, well-formed queriesSanitization, validation, adversarial input protection
Tool reliabilityAssumes tools always workRetries, circuit breakers, fallback strategies
Error handlingBasic try/catchGraceful degradation, user-friendly error messages, logging
LatencyNo strict requirementsP95 < 5s, timeout enforcement, parallel execution
CostUnlimited budgetToken budgets, cost per task < target, optimization
Scale1-10 concurrent users10K+ users, horizontal scaling, rate limiting
ObservabilityPrint statementsDistributed tracing, metrics, structured logging, dashboards
SecurityNoneAuthentication, authorization, audit logging, secrets management
ComplianceNot consideredData residency, audit trails, regulatory approvals

Deployment architecture: infrastructure patterns

Production agent systems require robust infrastructure. Choices: cloud-hosted vs self-hosted, monolith vs microservices, synchronous vs asynchronous, stateful vs stateless.

Cloud-hosted vs self-hosted

Cloud-hosted advantages:

  • Rapid deployment (hours, not weeks)
  • Automatic scaling based on load
  • Managed updates and patches
  • Pay-per-use pricing (lower upfront cost)

Cloud-hosted tradeoffs:

  • Data residency restrictions in regulated industries
  • Vendor lock-in to specific platforms
  • Higher long-term costs at scale
  • Limited customization options

Self-hosted advantages:

  • Complete control over data and infrastructure
  • No vendor lock-in
  • Lower costs at large scale
  • Full customization capability

Self-hosted tradeoffs:

  • Requires dedicated operations expertise
  • Higher upfront investment
  • Responsible for security and compliance
  • Manual scaling and updates

Recommended architecture: hybrid approach

LLM inference: cloud-hosted (OpenAI, Anthropic, Google APIs)
Agent orchestration: self-hosted or managed Kubernetes
Vector databases: managed services (Pinecone, Weaviate Cloud)
State management: Redis or DynamoDB
Observability: managed platforms (Langfuse, Arize, LangSmith)

This balances rapid deployment (cloud LLMs) with control (self-hosted orchestration) and cost optimization (managed databases for scalability).

Containerization and orchestration

Containerize agent services using Docker. Deploy to Kubernetes for automatic scaling, health checks, self-healing.

Essential Kubernetes features for agents:

  • Horizontal Pod Autoscaler: scale agent pods based on CPU, memory, or custom metrics (request rate)
  • Liveness/readiness probes: detect unhealthy pods and restart automatically
  • Resource limits: prevent runaway token consumption from exhausting cluster resources
  • ConfigMaps/Secrets: manage prompts, API keys, configuration without code changes

Synchronous vs asynchronous execution

Synchronous (request/response):

  • User waits for agent to complete task
  • Simple to implement and reason about
  • Works for tasks completing in < 30 seconds
  • Poor user experience for long-running tasks

Asynchronous (message queue):

  • User receives immediate acknowledgment, task executes in background
  • Agent posts result when complete (webhook, WebSocket, polling)
  • Required for tasks taking > 30 seconds
  • More complex: need queue (RabbitMQ, SQS), job tracking, result delivery mechanism

Hybrid pattern:

  • Fast tasks (< 5s): synchronous
  • Medium tasks (5-30s): synchronous with streaming updates
  • Long tasks (> 30s): asynchronous with webhook notification

Observability: the essential stack for production agents

Traditional monitoring tracks infrastructure metrics (CPU, memory, requests/sec). AI observability adds: hallucination detection, reasoning quality, tool calling success, token consumption, multi-step workflow tracing, governance controls.

The five pillars of AI observability

1. Distributed tracing

Capture complete execution paths across agent workflows. Visibility into every LLM call, tool invocation, memory access. OpenTelemetry conventions enable vendor-neutral tracing.

Example trace for customer support agent:

TraceID: abc123
Span 1: User query received (duration: 5ms)
Span 2: Intent classification (LLM call, duration: 320ms, tokens: 150)
Span 3: Memory retrieval (vector search, duration: 80ms, results: 5)
Span 4: Tool call - get_customer_data (API, duration: 240ms)
Span 5: Tool call - get_order_history (API, duration: 180ms)
Span 6: Response generation (LLM call, duration: 450ms, tokens: 280)
Total duration: 1,275ms

2. Quality evaluation

Measure AI-specific dimensions beyond error rates:

DimensionMeasurementThreshold
Hallucination rate% of responses containing ungrounded claims< 5%
Response grounding% of claims supported by retrieved context> 90%
Relevance scoreSemantic similarity between query and response> 0.85
Task completion% of workflows completing successfully> 95%
Tool selection accuracy% of correct tool choices> 90%

3. Metrics and dashboards

Real-time visibility into agent health:

  • Request rate, error rate, latency (P50, P95, P99)
  • Token consumption (per request, per user, per day)
  • Cost metrics ($ per task, $ per user, $ per day)
  • Tool calling metrics (success rate, latency by tool)
  • Model performance (by model version, by prompt version)

4. Structured logging

Every agent action logged with correlation IDs:

{
  "timestamp": "2025-07-10T14:23:45Z",
  "traceId": "abc123",
  "spanId": "span456",
  "userId": "user789",
  "agentId": "support-agent-v2",
  "eventType": "tool_call",
  "toolName": "get_customer_data",
  "arguments": {"customer_id": "cust-456"},
  "result": {"status": "success", "latency_ms": 240},
  "tokens": {"input": 0, "output": 0}
}

5. Governance and compliance

Track and enforce policies:

  • Data access audit: which users accessed what data, when
  • Policy violations: attempts to access unauthorized data, risky tool use
  • Model version tracking: which model handled each request (for regulatory compliance)
  • Cost attribution: spend by user, team, project for chargebacks

OpenTelemetry: the industry standard

OpenTelemetry emerged as standard framework for AI observability. Vendor-neutral approach enables telemetry collection across components from different vendors.

GenAI semantic conventions:

  • Standardized attribute names for LLM calls (gen_ai.request.model, gen_ai.response.finish_reason)
  • Tool invocation tracking (gen_ai.tool.name, gen_ai.tool.arguments)
  • Token consumption (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens)

Adoption across platforms: LangChain, LangSmith, Arize, Langfuse all support OpenTelemetry. Enables switching observability providers without code changes.

Leading observability platforms

Langfuse:

  • Open-source LLM engineering platform
  • Tracing, prompt management, evaluation, datasets
  • Self-hosted or cloud options
  • Strong integration with LangChain ecosystem

Arize:

  • $70M Series C (February 2025)
  • ML observability + LLM observability
  • Advanced drift detection, evaluation suites
  • Enterprise-focused with compliance features

LangSmith:

  • Official LangChain observability platform
  • Deep integration with LangChain/LangGraph
  • Prompt playground, dataset management, testing
  • Best-in-class for LangChain users

Maxim AI and Galileo:

  • Focus on evaluation and quality metrics
  • Automated hallucination detection
  • Guardrails and safety monitoring

Scaling strategies: from 10 to 10,000 users

Agent systems have different scaling bottlenecks than traditional apps. LLM inference latency doesn't improve with horizontal scaling. Token costs scale linearly with users. Vector DB query latency increases with index size.

Horizontal scaling: stateless design

Design agents to be stateless. Session state stored externally (Redis, DynamoDB), not in-memory. Enables adding agent instances without coordination.

Scaling checkpoints:

User countArchitectureKey changes
0-100Single serverSimple deployment, no load balancing
100-1,0002-3 instances + load balancerAdd Redis for session state
1,000-10,000Auto-scaling pool (5-20 instances)Implement caching, optimize token usage
10,000+Multi-region deploymentCDN for static assets, database sharding

Caching: semantic and deterministic

Semantic caching:

  • Cache responses for semantically similar queries
  • Embed query, search for cached responses with similarity > 0.95
  • Reduces LLM calls by 30-60% for repetitive queries
  • Implementation: vector DB (Pinecone, Redis Vector) for cache storage

Deterministic caching:

  • Cache responses for exact query matches
  • Works for API calls, database queries, deterministic computations
  • Simpler than semantic caching, but lower hit rate

Cache invalidation strategy:

  • Time-based: expire after N hours (for dynamic data)
  • Version-based: invalidate when prompt/model version changes
  • Event-based: invalidate when underlying data changes

Rate limiting and cost controls

Prevent abuse and manage costs. Rate limits at multiple levels:

LevelLimit typeExample threshold
UserRequests per minute20 requests/min (free tier), 100 requests/min (paid)
UserTokens per day50K tokens/day (free), 500K tokens/day (paid)
OrganizationCost per month$1K/month soft limit, $5K/month hard limit
TaskToken budget per task10K tokens max, kill if exceeded

Enforcement: API gateway (Kong, AWS API Gateway) for request limits, application-level checks for token/cost budgets.

Async processing for long-running tasks

Tasks taking > 30 seconds must be asynchronous. Pattern: job queue + worker pool + result delivery.

Implementation:

  1. User submits task, receives job ID immediately
  2. Task enqueued in message queue (RabbitMQ, SQS, Redis Queue)
  3. Worker pool (separate from API servers) processes jobs
  4. Worker stores result in database/object storage
  5. Worker notifies user (webhook, WebSocket, or user polls with job ID)

Benefits:

  • API servers not blocked by long tasks
  • Workers can scale independently based on queue depth
  • Failed jobs can retry without re-submitting user request
  • Priority queues for different user tiers

Security hardening: protecting production agents

Agents are attack surfaces. Prompt injection, data exfiltration, unauthorized tool use, cost exhaustion. Security is not optional.

Input sanitization and validation

Threat model:

  • Prompt injection: user input contains instructions that override system prompt
  • Data exfiltration: user tricks agent into revealing sensitive data
  • Resource exhaustion: user submits inputs that cause excessive token consumption

Mitigations:

  • Input length limits (prevent token bombs)
  • Pattern detection (flag/block known injection patterns)
  • User input sandboxing (clearly delimit user input in prompts)
  • LLM-as-moderator (separate LLM evaluates input for malicious intent before processing)

Tool access controls

Not all tools should be available to all users in all contexts.

Access control layers:

LayerControlExample
User authenticationOnly authenticated users access agentOAuth, API keys, JWTs
Role-based accessDifferent tool sets per roleAdmin sees delete_user, regular user doesn't
Data-level permissionsTools filter results by user permissionsget_customers only returns customers user owns
High-risk tool approvalDestructive tools require confirmationdelete_all_records requires human approval

Secrets management

Never hardcode API keys, database passwords, or other secrets. Use dedicated secrets management.

Options:

  • Cloud provider solutions: AWS Secrets Manager, Google Secret Manager, Azure Key Vault
  • Self-hosted: HashiCorp Vault, Doppler
  • Container orchestration: Kubernetes Secrets (encrypt at rest)

Best practices:

  • Rotate secrets regularly (every 90 days minimum)
  • Use short-lived tokens when possible
  • Audit secret access (who accessed what, when)
  • Separate secrets per environment (dev, staging, prod)

Audit logging for compliance

Regulated industries require comprehensive audit trails. Log all agent actions with attribution.

Required audit data:

  • User identity (who made the request)
  • Timestamp (when)
  • Action taken (tool called, data accessed)
  • Result (success/failure, data returned)
  • Justification (why agent took action—based on user query or autonomous decision)

Retention and immutability:

  • Store audit logs separately from application logs
  • Immutable storage (append-only, tamper-evident)
  • Retention periods per regulatory requirements (7 years for financial, varies by industry)

Cost management: shipping within budget

Token costs are the primary variable cost for agent systems. Unoptimized agents can cost 10-100x more than optimized ones.

Cost breakdown for typical agent task

ComponentTokensCost (GPT-4o)% of total
System prompt + tool definitions3,000$0.01540%
User query100$0.00051%
Tool planning (3 calls)600$0.0038%
Tool results injection1,500$0.007520%
Conversation history2,000$0.01027%
Response generation300$0.00454%
Total7,500$0.037100%

Cost optimization strategies

1. Compress system prompts and tool definitions

  • Remove verbose descriptions, keep only essential information
  • Use shorter parameter names
  • Dynamically load only relevant tools
  • Impact: 20-30% reduction in system prompt size

2. Aggressive conversation history compression

  • Don't pass full history every turn
  • Extract key facts, discard intermediate reasoning
  • Use Thread Transfer bundles for deterministic compression
  • Impact: 50-70% reduction in context size

3. Hybrid model strategy

  • Cheap model (GPT-4o-mini) for classification, tool calling, extraction
  • Expensive model (GPT-4o) for complex reasoning, user-facing generation
  • Impact: 60-80% cost reduction vs using expensive model everywhere

4. Semantic caching

  • Cache responses for similar queries
  • Avoids redundant LLM calls
  • Impact: 30-60% reduction in LLM API costs for repetitive workloads

5. Prompt caching (provider-level)

  • Anthropic, OpenAI, Google offer prompt caching
  • System prompt cached, only user message + history consumed as new tokens
  • Impact: 50-90% cost reduction on cached portions

Cost monitoring and alerting

Track in real-time:

  • Cost per task (track by task type)
  • Cost per user (identify expensive users)
  • Cost per day/week/month (budget tracking)
  • Token consumption trends (detect sudden increases)

Alert thresholds:

  • Daily spend exceeds budget by 20%
  • Per-task cost increases by 50% vs baseline
  • Single user consumes > 10% of daily budget

Continuous deployment and rollback

Agent behavior changes with prompt updates, model upgrades, tool modifications. Need safe deployment process.

Blue-green deployments

Run two identical production environments (blue and green). Deploy to green, validate, switch traffic.

Process:

  1. Blue environment serves production traffic
  2. Deploy new version to green environment
  3. Run smoke tests on green (synthetic traffic)
  4. Switch 5% of traffic to green (canary)
  5. Monitor metrics for 1 hour
  6. If metrics stable: gradually increase to 100%
  7. If metrics degrade: instant rollback to blue

Canary deployments with metrics gates

Deployment gates:

MetricThresholdAction if violated
Error rate< 5% (within +1pp of baseline)Automatic rollback
Task completion rate> 90% (within -2pp of baseline)Automatic rollback
P95 latency< 6s (within +20% of baseline)Pause rollout, investigate
Hallucination rate< 5% (within +1pp of baseline)Automatic rollback
Cost per taskWithin +30% of baselinePause rollout, investigate

Feature flags for prompt versioning

Decouple deployments from releases. Deploy code with multiple prompt versions, control which users see which version via feature flags.

Use cases:

  • A/B test prompt variations
  • Gradual rollout of new prompts to user segments
  • Instant rollback without redeployment (just toggle flag)
  • User-specific customizations (power users get advanced prompts)

Production checklist: pre-launch requirements

Infrastructure

  • Containerized deployment with health checks
  • Auto-scaling configured (horizontal for agents, vertical for databases)
  • Load balancing with session affinity if needed
  • Backup and disaster recovery plan tested

Observability

  • Distributed tracing implemented (OpenTelemetry)
  • Dashboards for key metrics (latency, error rate, cost, task completion)
  • Alerting configured with on-call rotation
  • Structured logging with correlation IDs

Performance

  • Load tested to 2x expected peak traffic
  • P95 latency < 5 seconds under load
  • Caching implemented (semantic + deterministic)
  • Token budgets enforced per task and per user

Security

  • Authentication and authorization enforced
  • Input sanitization and validation
  • Secrets stored in dedicated management system
  • Audit logging for all data access and tool use

Cost management

  • Cost per task measured and within budget
  • Rate limits configured by user tier
  • Cost alerts configured for anomalous spend
  • Optimization strategy documented (caching, hybrid models, compression)

Compliance

  • Data residency requirements met
  • Audit logs immutable with required retention
  • Model versions tracked for regulatory compliance
  • Privacy policy and terms of service reviewed

Key takeaways

  • Production readiness requires infrastructure, observability, security, cost management. Demo agents work with clean inputs. Production agents handle adversarial inputs, flaky tools, scale challenges.
  • AI observability is different: traditional monitoring tracks CPU/memory. AI observability adds hallucination detection, reasoning evaluation, multi-step workflow tracing, governance. OpenTelemetry is industry standard.
  • Leading platforms: Langfuse (open-source, LangChain integration), Arize ($70M Series C, enterprise focus), LangSmith (official LangChain platform), Maxim AI and Galileo (evaluation and safety).
  • Scaling strategies: horizontal scaling requires stateless design, semantic caching reduces LLM calls 30-60%, async processing for tasks > 30s, rate limiting prevents abuse.
  • Security hardening: input sanitization (prevent injection), tool access controls (RBAC + data-level permissions), secrets management (rotate every 90 days), audit logging (immutable, retained per regulations).
  • Cost optimization: compress prompts (20-30% savings), compress history (50-70% savings), hybrid models (60-80% savings), semantic caching (30-60% savings), prompt caching (50-90% on cached portions).
  • Safe deployments: blue-green for instant rollback, canary with metrics gates (error rate, latency, cost), feature flags for prompt versioning, automated rollback on threshold violations.