Thread Transfer
Context caching explained: Reduce token costs by 90%
Cached tokens cost 50-90% less. We explain how caching works and share patterns that maximize your hit rate.
Jorgo Bardho
Founder, Thread Transfer
If you're sending the same system prompt, knowledge base excerpt, or conversation history to an LLM repeatedly, you're burning money. Context caching lets providers reuse prompt prefixes across requests, slashing costs by 50-90% on cached portions. OpenAI and Anthropic both support it—here's how to architect your prompts for maximum cache hits and why it matters.
How context caching works
LLMs process tokens sequentially. When you send a prompt, the model computes intermediate representations (key-value pairs in the attention mechanism) for every token. These computations are expensive—they account for most of the per-token cost.
Context caching stores those intermediate representations on the provider's side. If your next request starts with the same token sequence, the provider reuses the cached computation instead of reprocessing from scratch. You pay a small cache read fee (10-50x cheaper than full processing) and avoid recomputing thousands of tokens.
Provider support: OpenAI vs Anthropic
OpenAI (GPT-4o, GPT-4 Turbo). Cached input tokens cost 50% of standard rates: $1.25/M for GPT-4o instead of $2.50/M. Caching is automatic for prompt prefixes over 1024 tokens that repeat within a 1-hour window. No special API calls required—just structure your prompts consistently.
Anthropic (Claude 3.5+). More aggressive caching. Cache writes cost $3.75/M (25% premium over standard $3/M input). Cache reads cost $0.30/M—90% discount. Caches persist for 5 minutes of inactivity. You explicitly mark cacheable blocks via API parameters, giving you fine-grained control.
Architecting prompts for cache hits
Rule 1: Prefix, don't suffix. Caching only works on the beginning of the prompt. Structure prompts as: [static system instructions] + [static knowledge/context] + [dynamic user query]. The static portion gets cached; the dynamic tail is processed fresh each time.
Rule 2: Stabilize your prefix. Every time you change the cached portion—even a single token—you bust the cache. Lock down system prompts, knowledge base versions, and document templates. Versioning helps: append a version hash so you can track when prompts drift.
Rule 3: Batch similar requests. If you're processing 1000 queries against the same knowledge base, ensure all requests use identical prefixes. Randomized preambles or timestamps embedded in prompts kill cache efficiency.
Rule 4: Exceed minimum thresholds. OpenAI requires 1024+ tokens to cache. Anthropic has no hard floor but recommends 1024+. If your static context is under that, consider batching multiple knowledge snippets into a single cached prefix.
Common use cases and savings math
RAG (Retrieval-Augmented Generation). You retrieve the same 10 documents for 500 queries. Without caching: 500 requests × 8000 tokens × $0.0025 = $10. With caching: 1 cache write ($0.03) + 499 cache reads ($0.03) + 500 × 200 dynamic tokens ($0.25) = $0.31. 97% savings.
Customer support agents. Same knowledge base prefix (policies, product docs) sent with every ticket. Prefix: 12,000 tokens. Daily volume: 2000 tickets. Monthly cost without caching: $180. With caching: $18 (cache writes) + $7 (cache reads) = $25. 86% savings.
Code analysis. Repository context (file tree, dependencies, common modules) stays fixed while you analyze different functions. 50k token repo context, 100 queries. Standard cost: $12.50. Cached cost: $1.25. 90% savings.
Common pitfalls and how to avoid them
- Timestamp injection. Adding "Current time: 2025-03-10 14:32:15" to every prompt busts the cache. Move timestamps to the dynamic suffix or round to nearest hour/day if precision isn't critical.
- Dynamic instructions. Tweaking prompt instructions per request ("be creative" vs. "be concise") invalidates caching. Standardize instructions in the prefix and control behavior via separate parameters.
- Ignoring TTL. Anthropic caches expire after 5 minutes of inactivity. For low-volume workloads, you might miss the window. Batch requests when possible or accept cache misses for sporadic queries.
- Over-caching. If your "static" context changes every 10 requests, you're paying cache write fees without reaping read benefits. Profile hit rates: aim for 10:1 reads-to-writes or better.
Combining caching with bundling
Thread-Transfer bundles compress long conversations into structured, deterministic context. When you pair bundling with caching, you get compounding savings: bundles reduce the overall token count (40-80%), and caching eliminates redundant processing on the remaining tokens (50-90%).
Example: 20k token conversation → 4k token bundle (80% reduction) → cached as prefix. 100 queries later, you've processed 4k tokens once and reused them 99 times. Effective cost: 40 tokens per query instead of 20,000—a 500x improvement.
Monitoring cache efficiency
Track these metrics: cache hit rate (reads / total cached requests), cost per request (segmented by hit vs. miss), cache write frequency (how often prefixes change). Set alerts for hit rate drops below 70%—signals prompt drift or versioning issues.
Log cache metadata (cache IDs, TTLs) alongside request traces so you can correlate performance and cost anomalies with cache behavior.
When caching isn't worth it
If every request has unique context (personalized user histories, one-off document analysis), caching adds complexity without savings. Focus on prompt compression and model routing instead. Caching shines when you have shared, stable context reused across many requests.
Closing thoughts
Context caching is the lowest-hanging fruit for LLM cost reduction—no changes to your application logic, just prompt structure. Start by identifying high-volume endpoints with repeated context. Restructure prompts to move stable content to the prefix. Monitor hit rates and iterate. Combined with bundling and routing, you'll cut costs by 70-90% while improving latency.
Need help auditing your prompts for cache efficiency? Reach out.
Learn more: How it works · Why bundles beat raw thread history