Thread Transfer

Batch Processing vs Real-Time AI Inference

Batch saves 50-70% but requires latency tolerance. Real-time guarantees responsiveness at premium pricing. Most systems need both—here's the decision framework.

Jorgo Bardho

Founder, Thread Transfer

July 17, 2025•16 min read

ai costsbatch processingreal-time inferenceinfrastructure

Batch processing costs 50-70% less than real-time inference across major providers—but only if you architect for it correctly. Understanding when to use batch vs. real-time determines whether you're paying $1,000/month or $200/month for the same throughput. This guide breaks down the economics, latency tradeoffs, and decision framework for 2025.

The fundamental cost difference

Real-time endpoints run 24/7, whether you're sending requests or not. You pay for idle capacity. Batch processing provisions compute only when processing jobs, then releases resources. You pay only for active processing time.

Example: A real-time SageMaker endpoint costs $1.41/hour ($1,030/month) for ml.g5.xlarge. The same instance in batch mode, running 6 hours/day processing nightly jobs, costs $254/month (75% savings). For predictable low-volume workloads, serverless inference can drop that $1,000/month endpoint to $200/month—or spike it to $2,000/month if traffic patterns mismatch.

Provider pricing comparison: batch vs. real-time

Major cloud providers

Provider	Batch Discount	Real-Time Pricing	Batch Pricing	Notes
AWS Bedrock	50%	Standard on-demand	50% off on-demand	Claude 3.5 Sonnet, Llama models
Google Cloud	Up to 70%	Standard compute rates	30-70% discount	Off-peak scheduling eligible
Azure AI	30-70%	Standard inference	Tiered by volume	Higher discounts for commitments
Together.ai	50%	Real-time API	Batch API	Most serverless models

LLM API providers (2025 pricing)

Provider/Model	Real-Time Input	Real-Time Output	Batch Input	Batch Output	Savings
OpenAI GPT-4o	$2.50/M	$10.00/M	$1.25/M	$5.00/M	50%
Anthropic Claude 3.5 Sonnet	$3.00/M	$15.00/M	$1.50/M	$7.50/M	50%
Anthropic Claude 3 Haiku	$0.25/M	$1.25/M	$0.125/M	$0.625/M	50%
Google Gemini 1.5 Pro	$1.25/M	$5.00/M	$0.625/M	$2.50/M	50%

Why batch processing is cheaper

1. No idle capacity costs

Real-time endpoints maintain warm instances for sub-second response times. A chatbot endpoint serving 1,000 requests/day still pays for 23.5 hours of idle time. Batch jobs spin up compute on-demand, process queued requests, and terminate. You pay for 30 minutes of processing, not 24 hours of standby.

2. Off-peak resource scheduling

Cloud providers charge 30-60% less for compute during off-peak hours (2am-6am in most regions). Batch jobs can schedule processing when rates are lowest. Real-time endpoints pay peak rates 24/7 to guarantee availability.

3. Higher hardware utilization

Batch processing groups requests together, maximizing GPU/TPU utilization (80-95% vs. 20-40% for sporadic real-time traffic). Providers pass savings through as batch discounts. Batching 100 requests into a single job reduces per-request overhead (connection setup, model loading) from 100x to 1x.

4. Spot instance eligibility

Batch workloads tolerate interruptions—they can resume after spot instance preemption. Spot GPUs cost 60-90% less than on-demand. Real-time endpoints require guaranteed availability, disqualifying them from spot pricing.

Real-world cost scenarios

Scenario 1: Nightly data processing (clear win for batch)

Use case: Summarizing 10,000 support tickets nightly. Each ticket: 800 tokens input, 200 tokens output. Total daily volume: 8M input + 2M output tokens.

Real-time (Claude 3.5 Sonnet): (8M × $0.003) + (2M × $0.015) = $24 + $30 = $54/day = $1,620/month. Plus endpoint infrastructure: $1,030/month (ml.g5.xlarge). Total: $2,650/month.

Batch (Claude 3.5 Sonnet): (8M × $0.0015) + (2M × $0.0075) = $12 + $15 = $27/day = $810/month. No persistent endpoint cost. Total: $810/month.

Savings: 69% ($1,840/month).

Scenario 2: Real-time chatbot (batch not viable)

Use case: Customer support chatbot, 500 conversations/day, unpredictable timing. Average: 3,000 tokens input, 800 tokens output per conversation.

Real-time required: Users expect <2 second responses. Batch latency (minutes to hours) unacceptable. Daily volume: 1.5M input + 400k output tokens.

Real-time (Claude 3 Haiku for cost): (1.5M × $0.00025) + (400k × $0.00125) = $0.375 + $0.50 = $0.875/day = $26/month. Endpoint: $200/month (serverless, optimized for bursty traffic). Total: $226/month.

Batch: Not applicable. Latency requirements mandate real-time.

Scenario 3: Hybrid approach (batch + real-time)

Use case: E-commerce product recommendations. Urgent: 2,000 real-time requests/day (user browsing). Deferrable: 50,000 batch requests/night (catalog enrichment, personalization model updates).

Real-time portion (GPT-4o): 2,000 requests × 1,500 tokens avg = 3M tokens/day. (2M input × $0.0025) + (1M output × $0.01) = $5 + $10 = $15/day = $450/month.

Batch portion (GPT-4o batch): 50,000 requests × 2,000 tokens avg = 100M tokens/day. (60M input × $0.00125) + (40M output × $0.005) = $75 + $200 = $275/day = $8,250/month.

Total: $8,700/month. If all-real-time: $17,400/month (50% savings via hybrid).

Latency tradeoffs: when batch is viable

Batch latency characteristics

Queue time: Requests accumulate until batch job triggers (5 minutes to 24 hours, configurable)
Provisioning time: Spinning up compute resources (30 seconds to 5 minutes)
Processing time: Actual inference (milliseconds per request, but batched)
Total latency: Typically 5 minutes to 12 hours, depending on schedule

Use cases well-suited for batch

Nightly ETL/processing: Daily summaries, reports, data enrichment
Bulk content generation: SEO descriptions, product copy, email campaigns
Offline analysis: Sentiment analysis on archives, document classification
Model training data prep: Generating synthetic data, labeling, augmentation
Scheduled personalization: Weekly recommendation updates, monthly insights

Use cases requiring real-time

Interactive chat: Customer support, conversational AI
Real-time recommendations: Product suggestions during active browsing
Content moderation: Immediate filtering of user-generated content
Live translation: Real-time language translation for calls/video
Dynamic pricing: Per-request price calculations based on current context

Optimization strategies for each mode

Batch optimization

Group by priority: High-priority batches (1-hour SLA) vs. low-priority (24-hour SLA). Pay premium for faster turnaround only when needed.
Off-peak scheduling: Run large jobs 2am-6am for 30-60% compute discounts.
Spot instances: Use preemptible VMs for non-urgent jobs (60-90% savings). Implement checkpointing for resilience.
Batching granularity: Larger batches (10k+ requests) amortize overhead better but increase latency. Find the sweet spot for your SLAs.
Format optimization: Pre-tokenize inputs, compress payloads, use JSONL for bulk uploads (reduces transfer costs).

Real-time optimization

Serverless endpoints: For variable traffic (<50% utilization), serverless auto-scaling eliminates idle costs.
Cache aggressively: Cache prompt prefixes (50-90% savings), cache responses for FAQ-style queries (near-zero marginal cost).
Model routing: Route simple queries to cheaper models (Haiku, GPT-3.5), complex to premium (Sonnet, GPT-4). 40-60% blended cost reduction.
Prompt compression: Reduce token counts via bundling (Thread Transfer: 40-80% compression). Lowers per-request cost linearly.
Connection pooling: Reuse HTTP connections, maintain persistent sessions to reduce overhead.

Hybrid architectures: best of both worlds

Most production systems benefit from combining batch and real-time. Route latency-sensitive requests to real-time endpoints, defer everything else to batch. This maximizes cost efficiency without compromising UX.

Implementation pattern

Request classification: At ingestion, tag requests with priority (urgent/standard/low). Urgent → real-time API. Standard → 15-minute batch queue. Low → nightly batch.

Queue management: Use message queues (SQS, Pub/Sub) to accumulate batch requests. Trigger jobs when queue reaches threshold (e.g., 1000 requests) or timeout (e.g., 15 minutes), whichever comes first.

Result delivery: Real-time: synchronous response. Batch: webhook callback or polling endpoint for result retrieval.

Example: Thread Transfer's architecture

Thread Transfer uses batch processing for bundle compilation (compressing conversation histories). Users upload threads asynchronously, backend batches 50-100 threads per job, processes overnight. Morning delivery via email/API. Cost: 50% of real-time alternative. Latency: 8-12 hours. Acceptable for non-urgent context management.

For urgent bundles (sales calls, live support escalations), we offer real-time tier at 2x cost with <5 minute SLA. 90% of volume uses batch; 10% pays premium for urgency.

Cost modeling: batch vs. real-time decision tree

Choose batch if:

Latency SLA > 5 minutes
Request volume predictable (can schedule efficiently)
Traffic pattern allows batching (>100 requests/batch)
Cost sensitivity > latency sensitivity
Workload is deferrable (no user waiting)

Choose real-time if:

Latency SLA < 5 seconds
Requests arrive unpredictably (can't batch effectively)
Volume too low to justify batch overhead (<100 requests/day)
User experience depends on immediate response
Revenue-critical path (e.g., checkout flow)

Choose hybrid if:

Mixed latency requirements (some urgent, some deferrable)
High total volume (>10k requests/day) with mixed priorities
Cost optimization important but UX can't suffer

Future trends: serverless blurring the lines

Serverless inference (AWS Lambda, GCP Cloud Functions, Azure Container Instances) is eroding the batch vs. real-time distinction. Serverless scales to zero—no idle costs—while maintaining sub-second cold-start latency.

For workloads with unpredictable traffic (<50% utilization), serverless can match batch economics while offering real-time responsiveness. Limitations: Cold starts (1-3 seconds), less control over hardware, higher per-request cost than dedicated batch.

Expect continued innovation here—providers competing on serverless performance, reducing cold-start penalties, and offering hybrid pricing (reserved capacity + burst serverless).

Monitoring and optimization checklist

Track utilization: Real-time endpoint utilization <60%? Migrate deferrable workloads to batch or serverless.
Measure latency distribution: P50, P95, P99. If P95 > 5 minutes, batch is viable for those requests.
Cost per request: Segment by endpoint type (real-time vs. batch). Identify high-cost outliers.
SLA adherence: Batch jobs missing SLAs? Increase frequency or provision more compute. Real-time p99 > target? Add capacity.
Request classification accuracy: Are "urgent" requests truly urgent? Misclassified requests inflate real-time costs unnecessarily.

Closing thoughts

Batch processing offers 50-70% cost savings but requires latency tolerance. Real-time guarantees responsiveness but at premium pricing. Most production systems benefit from hybrid architectures—routing intelligently based on actual latency requirements.

Start by profiling your workload: latency SLAs, request volume, traffic patterns. Default to batch for deferrable work, real-time for user-facing paths. Monitor cost vs. performance tradeoffs and iterate. Combined with caching and bundling (Thread Transfer: 40-80% token reduction), teams routinely achieve 70-85% total infrastructure cost reduction.

Need help architecting batch/real-time routing for your LLM workload? Reach out.

Learn more: How it works · Why bundles beat raw thread history