Thread Transfer
Batch Processing vs Real-Time AI Inference
Batch saves 50-70% but requires latency tolerance. Real-time guarantees responsiveness at premium pricing. Most systems need both—here's the decision framework.
Jorgo Bardho
Founder, Thread Transfer
Batch processing costs 50-70% less than real-time inference across major providers—but only if you architect for it correctly. Understanding when to use batch vs. real-time determines whether you're paying $1,000/month or $200/month for the same throughput. This guide breaks down the economics, latency tradeoffs, and decision framework for 2025.
The fundamental cost difference
Real-time endpoints run 24/7, whether you're sending requests or not. You pay for idle capacity. Batch processing provisions compute only when processing jobs, then releases resources. You pay only for active processing time.
Example: A real-time SageMaker endpoint costs $1.41/hour ($1,030/month) for ml.g5.xlarge. The same instance in batch mode, running 6 hours/day processing nightly jobs, costs $254/month (75% savings). For predictable low-volume workloads, serverless inference can drop that $1,000/month endpoint to $200/month—or spike it to $2,000/month if traffic patterns mismatch.
Provider pricing comparison: batch vs. real-time
Major cloud providers
| Provider | Batch Discount | Real-Time Pricing | Batch Pricing | Notes |
|---|---|---|---|---|
| AWS Bedrock | 50% | Standard on-demand | 50% off on-demand | Claude 3.5 Sonnet, Llama models |
| Google Cloud | Up to 70% | Standard compute rates | 30-70% discount | Off-peak scheduling eligible |
| Azure AI | 30-70% | Standard inference | Tiered by volume | Higher discounts for commitments |
| Together.ai | 50% | Real-time API | Batch API | Most serverless models |
LLM API providers (2025 pricing)
| Provider/Model | Real-Time Input | Real-Time Output | Batch Input | Batch Output | Savings |
|---|---|---|---|---|---|
| OpenAI GPT-4o | $2.50/M | $10.00/M | $1.25/M | $5.00/M | 50% |
| Anthropic Claude 3.5 Sonnet | $3.00/M | $15.00/M | $1.50/M | $7.50/M | 50% |
| Anthropic Claude 3 Haiku | $0.25/M | $1.25/M | $0.125/M | $0.625/M | 50% |
| Google Gemini 1.5 Pro | $1.25/M | $5.00/M | $0.625/M | $2.50/M | 50% |
Why batch processing is cheaper
1. No idle capacity costs
Real-time endpoints maintain warm instances for sub-second response times. A chatbot endpoint serving 1,000 requests/day still pays for 23.5 hours of idle time. Batch jobs spin up compute on-demand, process queued requests, and terminate. You pay for 30 minutes of processing, not 24 hours of standby.
2. Off-peak resource scheduling
Cloud providers charge 30-60% less for compute during off-peak hours (2am-6am in most regions). Batch jobs can schedule processing when rates are lowest. Real-time endpoints pay peak rates 24/7 to guarantee availability.
3. Higher hardware utilization
Batch processing groups requests together, maximizing GPU/TPU utilization (80-95% vs. 20-40% for sporadic real-time traffic). Providers pass savings through as batch discounts. Batching 100 requests into a single job reduces per-request overhead (connection setup, model loading) from 100x to 1x.
4. Spot instance eligibility
Batch workloads tolerate interruptions—they can resume after spot instance preemption. Spot GPUs cost 60-90% less than on-demand. Real-time endpoints require guaranteed availability, disqualifying them from spot pricing.
Real-world cost scenarios
Scenario 1: Nightly data processing (clear win for batch)
Use case: Summarizing 10,000 support tickets nightly. Each ticket: 800 tokens input, 200 tokens output. Total daily volume: 8M input + 2M output tokens.
Real-time (Claude 3.5 Sonnet): (8M × $0.003) + (2M × $0.015) = $24 + $30 = $54/day = $1,620/month. Plus endpoint infrastructure: $1,030/month (ml.g5.xlarge). Total: $2,650/month.
Batch (Claude 3.5 Sonnet): (8M × $0.0015) + (2M × $0.0075) = $12 + $15 = $27/day = $810/month. No persistent endpoint cost. Total: $810/month.
Savings: 69% ($1,840/month).
Scenario 2: Real-time chatbot (batch not viable)
Use case: Customer support chatbot, 500 conversations/day, unpredictable timing. Average: 3,000 tokens input, 800 tokens output per conversation.
Real-time required: Users expect <2 second responses. Batch latency (minutes to hours) unacceptable. Daily volume: 1.5M input + 400k output tokens.
Real-time (Claude 3 Haiku for cost): (1.5M × $0.00025) + (400k × $0.00125) = $0.375 + $0.50 = $0.875/day = $26/month. Endpoint: $200/month (serverless, optimized for bursty traffic). Total: $226/month.
Batch: Not applicable. Latency requirements mandate real-time.
Scenario 3: Hybrid approach (batch + real-time)
Use case: E-commerce product recommendations. Urgent: 2,000 real-time requests/day (user browsing). Deferrable: 50,000 batch requests/night (catalog enrichment, personalization model updates).
Real-time portion (GPT-4o): 2,000 requests × 1,500 tokens avg = 3M tokens/day. (2M input × $0.0025) + (1M output × $0.01) = $5 + $10 = $15/day = $450/month.
Batch portion (GPT-4o batch): 50,000 requests × 2,000 tokens avg = 100M tokens/day. (60M input × $0.00125) + (40M output × $0.005) = $75 + $200 = $275/day = $8,250/month.
Total: $8,700/month. If all-real-time: $17,400/month (50% savings via hybrid).
Latency tradeoffs: when batch is viable
Batch latency characteristics
- Queue time: Requests accumulate until batch job triggers (5 minutes to 24 hours, configurable)
- Provisioning time: Spinning up compute resources (30 seconds to 5 minutes)
- Processing time: Actual inference (milliseconds per request, but batched)
- Total latency: Typically 5 minutes to 12 hours, depending on schedule
Use cases well-suited for batch
- Nightly ETL/processing: Daily summaries, reports, data enrichment
- Bulk content generation: SEO descriptions, product copy, email campaigns
- Offline analysis: Sentiment analysis on archives, document classification
- Model training data prep: Generating synthetic data, labeling, augmentation
- Scheduled personalization: Weekly recommendation updates, monthly insights
Use cases requiring real-time
- Interactive chat: Customer support, conversational AI
- Real-time recommendations: Product suggestions during active browsing
- Content moderation: Immediate filtering of user-generated content
- Live translation: Real-time language translation for calls/video
- Dynamic pricing: Per-request price calculations based on current context
Optimization strategies for each mode
Batch optimization
- Group by priority: High-priority batches (1-hour SLA) vs. low-priority (24-hour SLA). Pay premium for faster turnaround only when needed.
- Off-peak scheduling: Run large jobs 2am-6am for 30-60% compute discounts.
- Spot instances: Use preemptible VMs for non-urgent jobs (60-90% savings). Implement checkpointing for resilience.
- Batching granularity: Larger batches (10k+ requests) amortize overhead better but increase latency. Find the sweet spot for your SLAs.
- Format optimization: Pre-tokenize inputs, compress payloads, use JSONL for bulk uploads (reduces transfer costs).
Real-time optimization
- Serverless endpoints: For variable traffic (<50% utilization), serverless auto-scaling eliminates idle costs.
- Cache aggressively: Cache prompt prefixes (50-90% savings), cache responses for FAQ-style queries (near-zero marginal cost).
- Model routing: Route simple queries to cheaper models (Haiku, GPT-3.5), complex to premium (Sonnet, GPT-4). 40-60% blended cost reduction.
- Prompt compression: Reduce token counts via bundling (Thread Transfer: 40-80% compression). Lowers per-request cost linearly.
- Connection pooling: Reuse HTTP connections, maintain persistent sessions to reduce overhead.
Hybrid architectures: best of both worlds
Most production systems benefit from combining batch and real-time. Route latency-sensitive requests to real-time endpoints, defer everything else to batch. This maximizes cost efficiency without compromising UX.
Implementation pattern
Request classification: At ingestion, tag requests with priority (urgent/standard/low). Urgent → real-time API. Standard → 15-minute batch queue. Low → nightly batch.
Queue management: Use message queues (SQS, Pub/Sub) to accumulate batch requests. Trigger jobs when queue reaches threshold (e.g., 1000 requests) or timeout (e.g., 15 minutes), whichever comes first.
Result delivery: Real-time: synchronous response. Batch: webhook callback or polling endpoint for result retrieval.
Example: Thread Transfer's architecture
Thread Transfer uses batch processing for bundle compilation (compressing conversation histories). Users upload threads asynchronously, backend batches 50-100 threads per job, processes overnight. Morning delivery via email/API. Cost: 50% of real-time alternative. Latency: 8-12 hours. Acceptable for non-urgent context management.
For urgent bundles (sales calls, live support escalations), we offer real-time tier at 2x cost with <5 minute SLA. 90% of volume uses batch; 10% pays premium for urgency.
Cost modeling: batch vs. real-time decision tree
Choose batch if:
- Latency SLA > 5 minutes
- Request volume predictable (can schedule efficiently)
- Traffic pattern allows batching (>100 requests/batch)
- Cost sensitivity > latency sensitivity
- Workload is deferrable (no user waiting)
Choose real-time if:
- Latency SLA < 5 seconds
- Requests arrive unpredictably (can't batch effectively)
- Volume too low to justify batch overhead (<100 requests/day)
- User experience depends on immediate response
- Revenue-critical path (e.g., checkout flow)
Choose hybrid if:
- Mixed latency requirements (some urgent, some deferrable)
- High total volume (>10k requests/day) with mixed priorities
- Cost optimization important but UX can't suffer
Future trends: serverless blurring the lines
Serverless inference (AWS Lambda, GCP Cloud Functions, Azure Container Instances) is eroding the batch vs. real-time distinction. Serverless scales to zero—no idle costs—while maintaining sub-second cold-start latency.
For workloads with unpredictable traffic (<50% utilization), serverless can match batch economics while offering real-time responsiveness. Limitations: Cold starts (1-3 seconds), less control over hardware, higher per-request cost than dedicated batch.
Expect continued innovation here—providers competing on serverless performance, reducing cold-start penalties, and offering hybrid pricing (reserved capacity + burst serverless).
Monitoring and optimization checklist
- Track utilization: Real-time endpoint utilization <60%? Migrate deferrable workloads to batch or serverless.
- Measure latency distribution: P50, P95, P99. If P95 > 5 minutes, batch is viable for those requests.
- Cost per request: Segment by endpoint type (real-time vs. batch). Identify high-cost outliers.
- SLA adherence: Batch jobs missing SLAs? Increase frequency or provision more compute. Real-time p99 > target? Add capacity.
- Request classification accuracy: Are "urgent" requests truly urgent? Misclassified requests inflate real-time costs unnecessarily.
Closing thoughts
Batch processing offers 50-70% cost savings but requires latency tolerance. Real-time guarantees responsiveness but at premium pricing. Most production systems benefit from hybrid architectures—routing intelligently based on actual latency requirements.
Start by profiling your workload: latency SLAs, request volume, traffic patterns. Default to batch for deferrable work, real-time for user-facing paths. Monitor cost vs. performance tradeoffs and iterate. Combined with caching and bundling (Thread Transfer: 40-80% token reduction), teams routinely achieve 70-85% total infrastructure cost reduction.
Need help architecting batch/real-time routing for your LLM workload? Reach out.
Learn more: How it works · Why bundles beat raw thread history