Thread Transfer
Building Production-Ready AI Agents
OpenTelemetry AI observability, semantic caching for 30-60% token reduction, horizontal scaling patterns, and blue-green deployments. The production agent playbook.
Jorgo Bardho
Founder, Thread Transfer
Enterprises spend $50-250M on generative AI initiatives in 2025. Arize raised $70M Series C in February. New Relic, Splunk, Cisco all launched agent observability products. OpenTelemetry established semantic conventions for LLM tracing. Yet most agent deployments fail in production. Demo agents work. Production agents face adversarial inputs, flaky tools, context window limits, cost constraints, security attacks, scale challenges. Traditional monitoring tracks infrastructure metrics. AI observability needs hallucination detection, reasoning evaluation, multi-step workflow tracing, governance controls. This is the comprehensive breakdown of deployment architecture, observability stack, scaling strategies, security hardening, and cost management for shipping reliable agents.
The production readiness gap: demo vs deployment
Demo agent: works with clean inputs, perfect tool availability, unlimited budget. Production agent: handles malicious inputs, fails gracefully when tools are down, operates within cost/latency budgets, scales to 10K+ users, complies with regulatory requirements.
The readiness checklist
| Category | Demo requirement | Production requirement |
|---|---|---|
| Input handling | Clean, well-formed queries | Sanitization, validation, adversarial input protection |
| Tool reliability | Assumes tools always work | Retries, circuit breakers, fallback strategies |
| Error handling | Basic try/catch | Graceful degradation, user-friendly error messages, logging |
| Latency | No strict requirements | P95 < 5s, timeout enforcement, parallel execution |
| Cost | Unlimited budget | Token budgets, cost per task < target, optimization |
| Scale | 1-10 concurrent users | 10K+ users, horizontal scaling, rate limiting |
| Observability | Print statements | Distributed tracing, metrics, structured logging, dashboards |
| Security | None | Authentication, authorization, audit logging, secrets management |
| Compliance | Not considered | Data residency, audit trails, regulatory approvals |
Deployment architecture: infrastructure patterns
Production agent systems require robust infrastructure. Choices: cloud-hosted vs self-hosted, monolith vs microservices, synchronous vs asynchronous, stateful vs stateless.
Cloud-hosted vs self-hosted
Cloud-hosted advantages:
- Rapid deployment (hours, not weeks)
- Automatic scaling based on load
- Managed updates and patches
- Pay-per-use pricing (lower upfront cost)
Cloud-hosted tradeoffs:
- Data residency restrictions in regulated industries
- Vendor lock-in to specific platforms
- Higher long-term costs at scale
- Limited customization options
Self-hosted advantages:
- Complete control over data and infrastructure
- No vendor lock-in
- Lower costs at large scale
- Full customization capability
Self-hosted tradeoffs:
- Requires dedicated operations expertise
- Higher upfront investment
- Responsible for security and compliance
- Manual scaling and updates
Recommended architecture: hybrid approach
LLM inference: cloud-hosted (OpenAI, Anthropic, Google APIs)
Agent orchestration: self-hosted or managed Kubernetes
Vector databases: managed services (Pinecone, Weaviate Cloud)
State management: Redis or DynamoDB
Observability: managed platforms (Langfuse, Arize, LangSmith)
This balances rapid deployment (cloud LLMs) with control (self-hosted orchestration) and cost optimization (managed databases for scalability).
Containerization and orchestration
Containerize agent services using Docker. Deploy to Kubernetes for automatic scaling, health checks, self-healing.
Essential Kubernetes features for agents:
- Horizontal Pod Autoscaler: scale agent pods based on CPU, memory, or custom metrics (request rate)
- Liveness/readiness probes: detect unhealthy pods and restart automatically
- Resource limits: prevent runaway token consumption from exhausting cluster resources
- ConfigMaps/Secrets: manage prompts, API keys, configuration without code changes
Synchronous vs asynchronous execution
Synchronous (request/response):
- User waits for agent to complete task
- Simple to implement and reason about
- Works for tasks completing in < 30 seconds
- Poor user experience for long-running tasks
Asynchronous (message queue):
- User receives immediate acknowledgment, task executes in background
- Agent posts result when complete (webhook, WebSocket, polling)
- Required for tasks taking > 30 seconds
- More complex: need queue (RabbitMQ, SQS), job tracking, result delivery mechanism
Hybrid pattern:
- Fast tasks (< 5s): synchronous
- Medium tasks (5-30s): synchronous with streaming updates
- Long tasks (> 30s): asynchronous with webhook notification
Observability: the essential stack for production agents
Traditional monitoring tracks infrastructure metrics (CPU, memory, requests/sec). AI observability adds: hallucination detection, reasoning quality, tool calling success, token consumption, multi-step workflow tracing, governance controls.
The five pillars of AI observability
1. Distributed tracing
Capture complete execution paths across agent workflows. Visibility into every LLM call, tool invocation, memory access. OpenTelemetry conventions enable vendor-neutral tracing.
Example trace for customer support agent:
TraceID: abc123 Span 1: User query received (duration: 5ms) Span 2: Intent classification (LLM call, duration: 320ms, tokens: 150) Span 3: Memory retrieval (vector search, duration: 80ms, results: 5) Span 4: Tool call - get_customer_data (API, duration: 240ms) Span 5: Tool call - get_order_history (API, duration: 180ms) Span 6: Response generation (LLM call, duration: 450ms, tokens: 280) Total duration: 1,275ms
2. Quality evaluation
Measure AI-specific dimensions beyond error rates:
| Dimension | Measurement | Threshold |
|---|---|---|
| Hallucination rate | % of responses containing ungrounded claims | < 5% |
| Response grounding | % of claims supported by retrieved context | > 90% |
| Relevance score | Semantic similarity between query and response | > 0.85 |
| Task completion | % of workflows completing successfully | > 95% |
| Tool selection accuracy | % of correct tool choices | > 90% |
3. Metrics and dashboards
Real-time visibility into agent health:
- Request rate, error rate, latency (P50, P95, P99)
- Token consumption (per request, per user, per day)
- Cost metrics ($ per task, $ per user, $ per day)
- Tool calling metrics (success rate, latency by tool)
- Model performance (by model version, by prompt version)
4. Structured logging
Every agent action logged with correlation IDs:
{
"timestamp": "2025-07-10T14:23:45Z",
"traceId": "abc123",
"spanId": "span456",
"userId": "user789",
"agentId": "support-agent-v2",
"eventType": "tool_call",
"toolName": "get_customer_data",
"arguments": {"customer_id": "cust-456"},
"result": {"status": "success", "latency_ms": 240},
"tokens": {"input": 0, "output": 0}
}5. Governance and compliance
Track and enforce policies:
- Data access audit: which users accessed what data, when
- Policy violations: attempts to access unauthorized data, risky tool use
- Model version tracking: which model handled each request (for regulatory compliance)
- Cost attribution: spend by user, team, project for chargebacks
OpenTelemetry: the industry standard
OpenTelemetry emerged as standard framework for AI observability. Vendor-neutral approach enables telemetry collection across components from different vendors.
GenAI semantic conventions:
- Standardized attribute names for LLM calls (gen_ai.request.model, gen_ai.response.finish_reason)
- Tool invocation tracking (gen_ai.tool.name, gen_ai.tool.arguments)
- Token consumption (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens)
Adoption across platforms: LangChain, LangSmith, Arize, Langfuse all support OpenTelemetry. Enables switching observability providers without code changes.
Leading observability platforms
Langfuse:
- Open-source LLM engineering platform
- Tracing, prompt management, evaluation, datasets
- Self-hosted or cloud options
- Strong integration with LangChain ecosystem
Arize:
- $70M Series C (February 2025)
- ML observability + LLM observability
- Advanced drift detection, evaluation suites
- Enterprise-focused with compliance features
LangSmith:
- Official LangChain observability platform
- Deep integration with LangChain/LangGraph
- Prompt playground, dataset management, testing
- Best-in-class for LangChain users
Maxim AI and Galileo:
- Focus on evaluation and quality metrics
- Automated hallucination detection
- Guardrails and safety monitoring
Scaling strategies: from 10 to 10,000 users
Agent systems have different scaling bottlenecks than traditional apps. LLM inference latency doesn't improve with horizontal scaling. Token costs scale linearly with users. Vector DB query latency increases with index size.
Horizontal scaling: stateless design
Design agents to be stateless. Session state stored externally (Redis, DynamoDB), not in-memory. Enables adding agent instances without coordination.
Scaling checkpoints:
| User count | Architecture | Key changes |
|---|---|---|
| 0-100 | Single server | Simple deployment, no load balancing |
| 100-1,000 | 2-3 instances + load balancer | Add Redis for session state |
| 1,000-10,000 | Auto-scaling pool (5-20 instances) | Implement caching, optimize token usage |
| 10,000+ | Multi-region deployment | CDN for static assets, database sharding |
Caching: semantic and deterministic
Semantic caching:
- Cache responses for semantically similar queries
- Embed query, search for cached responses with similarity > 0.95
- Reduces LLM calls by 30-60% for repetitive queries
- Implementation: vector DB (Pinecone, Redis Vector) for cache storage
Deterministic caching:
- Cache responses for exact query matches
- Works for API calls, database queries, deterministic computations
- Simpler than semantic caching, but lower hit rate
Cache invalidation strategy:
- Time-based: expire after N hours (for dynamic data)
- Version-based: invalidate when prompt/model version changes
- Event-based: invalidate when underlying data changes
Rate limiting and cost controls
Prevent abuse and manage costs. Rate limits at multiple levels:
| Level | Limit type | Example threshold |
|---|---|---|
| User | Requests per minute | 20 requests/min (free tier), 100 requests/min (paid) |
| User | Tokens per day | 50K tokens/day (free), 500K tokens/day (paid) |
| Organization | Cost per month | $1K/month soft limit, $5K/month hard limit |
| Task | Token budget per task | 10K tokens max, kill if exceeded |
Enforcement: API gateway (Kong, AWS API Gateway) for request limits, application-level checks for token/cost budgets.
Async processing for long-running tasks
Tasks taking > 30 seconds must be asynchronous. Pattern: job queue + worker pool + result delivery.
Implementation:
- User submits task, receives job ID immediately
- Task enqueued in message queue (RabbitMQ, SQS, Redis Queue)
- Worker pool (separate from API servers) processes jobs
- Worker stores result in database/object storage
- Worker notifies user (webhook, WebSocket, or user polls with job ID)
Benefits:
- API servers not blocked by long tasks
- Workers can scale independently based on queue depth
- Failed jobs can retry without re-submitting user request
- Priority queues for different user tiers
Security hardening: protecting production agents
Agents are attack surfaces. Prompt injection, data exfiltration, unauthorized tool use, cost exhaustion. Security is not optional.
Input sanitization and validation
Threat model:
- Prompt injection: user input contains instructions that override system prompt
- Data exfiltration: user tricks agent into revealing sensitive data
- Resource exhaustion: user submits inputs that cause excessive token consumption
Mitigations:
- Input length limits (prevent token bombs)
- Pattern detection (flag/block known injection patterns)
- User input sandboxing (clearly delimit user input in prompts)
- LLM-as-moderator (separate LLM evaluates input for malicious intent before processing)
Tool access controls
Not all tools should be available to all users in all contexts.
Access control layers:
| Layer | Control | Example |
|---|---|---|
| User authentication | Only authenticated users access agent | OAuth, API keys, JWTs |
| Role-based access | Different tool sets per role | Admin sees delete_user, regular user doesn't |
| Data-level permissions | Tools filter results by user permissions | get_customers only returns customers user owns |
| High-risk tool approval | Destructive tools require confirmation | delete_all_records requires human approval |
Secrets management
Never hardcode API keys, database passwords, or other secrets. Use dedicated secrets management.
Options:
- Cloud provider solutions: AWS Secrets Manager, Google Secret Manager, Azure Key Vault
- Self-hosted: HashiCorp Vault, Doppler
- Container orchestration: Kubernetes Secrets (encrypt at rest)
Best practices:
- Rotate secrets regularly (every 90 days minimum)
- Use short-lived tokens when possible
- Audit secret access (who accessed what, when)
- Separate secrets per environment (dev, staging, prod)
Audit logging for compliance
Regulated industries require comprehensive audit trails. Log all agent actions with attribution.
Required audit data:
- User identity (who made the request)
- Timestamp (when)
- Action taken (tool called, data accessed)
- Result (success/failure, data returned)
- Justification (why agent took action—based on user query or autonomous decision)
Retention and immutability:
- Store audit logs separately from application logs
- Immutable storage (append-only, tamper-evident)
- Retention periods per regulatory requirements (7 years for financial, varies by industry)
Cost management: shipping within budget
Token costs are the primary variable cost for agent systems. Unoptimized agents can cost 10-100x more than optimized ones.
Cost breakdown for typical agent task
| Component | Tokens | Cost (GPT-4o) | % of total |
|---|---|---|---|
| System prompt + tool definitions | 3,000 | $0.015 | 40% |
| User query | 100 | $0.0005 | 1% |
| Tool planning (3 calls) | 600 | $0.003 | 8% |
| Tool results injection | 1,500 | $0.0075 | 20% |
| Conversation history | 2,000 | $0.010 | 27% |
| Response generation | 300 | $0.0045 | 4% |
| Total | 7,500 | $0.037 | 100% |
Cost optimization strategies
1. Compress system prompts and tool definitions
- Remove verbose descriptions, keep only essential information
- Use shorter parameter names
- Dynamically load only relevant tools
- Impact: 20-30% reduction in system prompt size
2. Aggressive conversation history compression
- Don't pass full history every turn
- Extract key facts, discard intermediate reasoning
- Use Thread Transfer bundles for deterministic compression
- Impact: 50-70% reduction in context size
3. Hybrid model strategy
- Cheap model (GPT-4o-mini) for classification, tool calling, extraction
- Expensive model (GPT-4o) for complex reasoning, user-facing generation
- Impact: 60-80% cost reduction vs using expensive model everywhere
4. Semantic caching
- Cache responses for similar queries
- Avoids redundant LLM calls
- Impact: 30-60% reduction in LLM API costs for repetitive workloads
5. Prompt caching (provider-level)
- Anthropic, OpenAI, Google offer prompt caching
- System prompt cached, only user message + history consumed as new tokens
- Impact: 50-90% cost reduction on cached portions
Cost monitoring and alerting
Track in real-time:
- Cost per task (track by task type)
- Cost per user (identify expensive users)
- Cost per day/week/month (budget tracking)
- Token consumption trends (detect sudden increases)
Alert thresholds:
- Daily spend exceeds budget by 20%
- Per-task cost increases by 50% vs baseline
- Single user consumes > 10% of daily budget
Continuous deployment and rollback
Agent behavior changes with prompt updates, model upgrades, tool modifications. Need safe deployment process.
Blue-green deployments
Run two identical production environments (blue and green). Deploy to green, validate, switch traffic.
Process:
- Blue environment serves production traffic
- Deploy new version to green environment
- Run smoke tests on green (synthetic traffic)
- Switch 5% of traffic to green (canary)
- Monitor metrics for 1 hour
- If metrics stable: gradually increase to 100%
- If metrics degrade: instant rollback to blue
Canary deployments with metrics gates
Deployment gates:
| Metric | Threshold | Action if violated |
|---|---|---|
| Error rate | < 5% (within +1pp of baseline) | Automatic rollback |
| Task completion rate | > 90% (within -2pp of baseline) | Automatic rollback |
| P95 latency | < 6s (within +20% of baseline) | Pause rollout, investigate |
| Hallucination rate | < 5% (within +1pp of baseline) | Automatic rollback |
| Cost per task | Within +30% of baseline | Pause rollout, investigate |
Feature flags for prompt versioning
Decouple deployments from releases. Deploy code with multiple prompt versions, control which users see which version via feature flags.
Use cases:
- A/B test prompt variations
- Gradual rollout of new prompts to user segments
- Instant rollback without redeployment (just toggle flag)
- User-specific customizations (power users get advanced prompts)
Production checklist: pre-launch requirements
Infrastructure
- Containerized deployment with health checks
- Auto-scaling configured (horizontal for agents, vertical for databases)
- Load balancing with session affinity if needed
- Backup and disaster recovery plan tested
Observability
- Distributed tracing implemented (OpenTelemetry)
- Dashboards for key metrics (latency, error rate, cost, task completion)
- Alerting configured with on-call rotation
- Structured logging with correlation IDs
Performance
- Load tested to 2x expected peak traffic
- P95 latency < 5 seconds under load
- Caching implemented (semantic + deterministic)
- Token budgets enforced per task and per user
Security
- Authentication and authorization enforced
- Input sanitization and validation
- Secrets stored in dedicated management system
- Audit logging for all data access and tool use
Cost management
- Cost per task measured and within budget
- Rate limits configured by user tier
- Cost alerts configured for anomalous spend
- Optimization strategy documented (caching, hybrid models, compression)
Compliance
- Data residency requirements met
- Audit logs immutable with required retention
- Model versions tracked for regulatory compliance
- Privacy policy and terms of service reviewed
Key takeaways
- Production readiness requires infrastructure, observability, security, cost management. Demo agents work with clean inputs. Production agents handle adversarial inputs, flaky tools, scale challenges.
- AI observability is different: traditional monitoring tracks CPU/memory. AI observability adds hallucination detection, reasoning evaluation, multi-step workflow tracing, governance. OpenTelemetry is industry standard.
- Leading platforms: Langfuse (open-source, LangChain integration), Arize ($70M Series C, enterprise focus), LangSmith (official LangChain platform), Maxim AI and Galileo (evaluation and safety).
- Scaling strategies: horizontal scaling requires stateless design, semantic caching reduces LLM calls 30-60%, async processing for tasks > 30s, rate limiting prevents abuse.
- Security hardening: input sanitization (prevent injection), tool access controls (RBAC + data-level permissions), secrets management (rotate every 90 days), audit logging (immutable, retained per regulations).
- Cost optimization: compress prompts (20-30% savings), compress history (50-70% savings), hybrid models (60-80% savings), semantic caching (30-60% savings), prompt caching (50-90% on cached portions).
- Safe deployments: blue-green for instant rollback, canary with metrics gates (error rate, latency, cost), feature flags for prompt versioning, automated rollback on threshold violations.
Learn more: How it works · Why bundles beat raw thread history