Skip to main content

Thread Transfer

Prompt Caching Strategies for AI Cost Reduction

Proper caching architecture achieves 87% hit rates vs. 4% with poor implementation. Learn prefix-first structures, provider tradeoffs, and real-world cost calculations.

Jorgo Bardho

Founder, Thread Transfer

July 16, 202515 min read
ai costsprompt cachingllm optimizationtoken economics
Diagram showing prompt caching architecture and cost savings

If you're sending the same system prompt, knowledge base, or conversation context to an LLM repeatedly, you're burning cash on redundant processing. Prompt caching lets providers reuse stable prefixes across requests, slashing costs by 50-90% and latency by 80-85%. OpenAI, Anthropic, and Google all support it—but implementation strategy determines whether you hit 87% cache rates or 4%.

How prompt caching works under the hood

LLMs process tokens sequentially, computing intermediate representations (key-value pairs in the attention mechanism) for each token. These KV cache computations account for most of the per-token cost and latency.

Prompt caching stores those intermediate representations on the provider's servers. When your next request starts with an identical token sequence, the provider reuses the cached computation instead of reprocessing from scratch. You pay a reduced cache read fee and skip thousands of token computations. Cache creation typically takes 2-4 seconds for large documents, but subsequent reads complete in milliseconds.

Provider comparison: OpenAI vs Anthropic vs Google

Each major provider implements caching differently. Understanding the tradeoffs determines which best fits your workload.

OpenAI (GPT-4o, GPT-4 Turbo)

Automatic caching. No code changes required. OpenAI caches prefixes automatically for prompts ≥1,024 tokens, with cache hits in 128-token increments. Cached input tokens cost 50% of standard rates: $1.25/M for GPT-4o vs. $2.50/M uncached.

TTL: 5-10 minutes of inactivity, maximum 1 hour. Best for high-volume workloads where requests arrive frequently enough to keep caches warm.

Anthropic (Claude 3.5+)

Explicit cache control. Requires code changes—you mark cacheable blocks via cache_control parameters. Supports up to 4 cache breakpoints per prompt for granular control.

Pricing (Claude 3.5 Sonnet): Cache writes cost $3.75/M (25% premium over standard $3/M input). Cache reads cost $0.30/M—90% discount. For extended 1-hour TTL, cache writes jump to $6/M (2x vs. 5-minute default).

Best for: Applications needing fine-grained control over what gets cached, or workloads with intermittent bursts where extended TTL justifies the 2x write cost.

Google (Gemini 2.5 Pro/Flash)

Context caching. Gemini 2.5 Flash/Pro now support implicit caching (automatic, like OpenAI) for prompts ≥1,028 tokens (Flash) or ≥2,048 tokens (Pro). Manual caching via API also available for earlier models.

Pricing: No cache write or storage fees. Cached tokens cost 0.25x standard input rates. Default TTL: 1 hour (customizable).

Best for: Applications prioritizing simplicity and longer cache lifetimes without write premiums.

Quick comparison table

ProviderModeMin TokensCache WriteCache ReadTTL
OpenAI GPT-4oAutomatic1,024$2.50/M$1.25/M (50% off)5-10 min
Anthropic Claude 3.5 SonnetExplicit1,024$3.75/M (25% premium)$0.30/M (90% off)5 min (default)
Anthropic (extended TTL)Explicit1,024$6.00/M (2x premium)$0.30/M (90% off)1 hour
Google Gemini 2.5 FlashImplicit1,028$0 (base rate)0.25x base rate1 hour

Architecture: prefix-first prompt structure

Caching only works on prompt prefixes—stable content must come first. Dynamic content breaks cache hits. Structure prompts as:

  1. Static system instructions (role definition, output format, guidelines)
  2. Static knowledge/context (documentation, knowledge base, code repositories)
  3. Dynamic user input (queries, tickets, function arguments)

Example: A customer support agent prompt might start with 10k tokens of product documentation and policies (static), followed by 500 tokens of current ticket details (dynamic). Cache the 10k prefix—reuse it for 1000 tickets. Total cached tokens: 10M. Without caching: 10B tokens processed. Cost reduction: 99%.

Common pitfalls that kill cache hit rates

1. Timestamp injection

Adding "Current time: 2025-07-16 14:32:15" to every prompt busts the cache—the prefix changes every second. Fix: Move timestamps to the dynamic suffix or round to nearest hour/day if sub-hour precision isn't critical.

2. Dynamic instructions in prefix

Tweaking prompt tone per request ("be creative" vs. "be concise") invalidates caching. Fix: Standardize instructions in the prefix and control behavior via temperature/top_p parameters instead.

3. Cache warming failures

Real-world testing shows cache hit rates as low as 4.2% without proper warm-up. When parallel requests fire simultaneously, none benefit from sibling caches because creation takes 2-4 seconds. Fix: Issue a minimal "warm-up" call (e.g., "Ready.") with the full prefix before launching parallel requests. Wait for creation to complete.

4. Ignoring minimum thresholds

OpenAI requires ≥1,024 tokens to cache. Anthropic recommends 1,024+ for Sonnet, 2,048+ for Haiku. Prompts below these thresholds bypass caching entirely. Fix: Batch multiple knowledge snippets into a single cached prefix if your static content falls short.

5. Over-caching low-reuse content

If your "static" context changes every 10 requests, you pay cache write fees without read benefits. Target 10:1 read-to-write ratios minimum. Profile actual hit rates—aim for 70%+ to justify caching overhead.

Advanced: semantic caching for higher hit rates

Exact prefix matching achieves ~30% hit rates for similar but non-identical prompts. Semantic caching converts queries to embeddings and matches on cosine similarity (0.85-0.95 threshold), boosting hit rates to 87%.

Implementation: Embed each prompt prefix on first request, store with cached response. For new requests, compute embedding and search cache for matches above similarity threshold. If found, reuse cached output. If not, process fresh and cache result with new embedding.

Tradeoff: Adds embedding computation cost (~$0.02/M tokens for Ada-002) but eliminates LLM calls for semantically similar requests. Best for Q&A systems where users phrase questions differently.

Real-world cost calculations

RAG system (document retrieval)

Scenario: 10 knowledge base documents (8,000 tokens total) retrieved for 1,000 queries. Query-specific input: 200 tokens each.

Without caching: 1,000 requests × 8,200 tokens × $0.0025 = $20.50

With caching (Anthropic): 1 cache write (8,000 tokens × $0.00375) + 999 cache reads (8,000 × 999 × $0.0003) + 1,000 dynamic queries (200 × 1,000 × $0.003) = $0.03 + $2.40 + $0.60 = $3.03

Savings: 85% ($17.47 saved)

Code analysis tool

Scenario: 50k token repository context, 200 function-specific queries. Dynamic input: 500 tokens each.

Without caching: 200 × 50,500 tokens × $0.0025 = $25.25

With caching (OpenAI): 1 cache write (50k × $0.0025) + 199 cache reads (50k × 199 × $0.00125) + 200 dynamic (500 × 200 × $0.0025) = $0.125 + $12.44 + $0.25 = $12.82

Savings: 49% ($12.43 saved)

Customer support automation (high volume)

Scenario: 12k token knowledge base, 5,000 tickets/day. Dynamic input: 300 tokens each.

Without caching: 5,000 × 12,300 tokens × $0.003 = $184.50/day = $5,535/month

With caching (Anthropic extended TTL): Assume 20 cache writes/day (every 3 hours) + cache reads + dynamic. (20 × 12k × $0.006) + (4,980 × 12k × $0.0003) + (5,000 × 300 × $0.003) = $1.44 + $17.93 + $4.50 = $23.87/day = $716/month

Savings: 87% ($4,819/month saved)

Combining caching with Thread Transfer bundling

Thread Transfer compresses long conversations into structured bundles, reducing token counts by 40-80%. Pairing bundling with caching creates compounding savings.

Example: 20k token conversation → 4k token bundle (80% reduction) → cached as prefix for 100 follow-up queries. Without bundling or caching: 20k × 100 = 2M tokens. With bundling + caching: 4k (cache write) + 4k × 99 (cache reads at 10% cost) = 43.6k effective tokens. Effective reduction: 97.8%.

Monitoring and optimization checklist

  • Cache hit rate: Track reads / (reads + writes). Target 70%+ for cost-effectiveness. Rates below 50% suggest prompt drift or insufficient reuse.
  • Cost per request: Segment by hit vs. miss. Identify which endpoints benefit most from caching.
  • Cache TTL alignment: For Anthropic, profile request intervals. If 90% arrive within 5 minutes, default TTL suffices. If 40% arrive 10-30 minutes apart, extended TTL justifies 2x write cost.
  • Prefix stability: Log cache IDs and version hashes. Alert on unexpected cache ID churn—signals unintended prompt changes.
  • Minimum threshold violations: Count requests below minimum cacheable size. Batch or restructure if >20% are sub-threshold.

When caching isn't worth it

Caching adds complexity. Skip it if:

  • Every request has unique context (personalized user histories, one-off analyses)
  • Request volume <100/day and static content changes frequently
  • Static prefix <1,024 tokens (below provider minimums)
  • Cache hit rates fall below 30% despite optimization

For these scenarios, focus on prompt compression, model routing (smaller models for simple tasks), or batch processing instead.

Implementation roadmap

  1. Audit prompts: Identify high-volume endpoints. Measure static vs. dynamic token ratios.
  2. Restructure prefixes: Move all static content (instructions, knowledge) to the beginning. Defer dynamic content to suffix.
  3. Choose provider strategy: Automatic (OpenAI/Gemini) for simplicity, explicit (Anthropic) for control.
  4. Implement cache warming: For parallel workloads, issue warm-up calls before main batch.
  5. Deploy and monitor: Track hit rates, cost per request, and TTL alignment. Iterate on prefix stability and batching.
  6. Combine with bundling: For conversation-heavy workloads, compress with Thread Transfer first, then cache bundles.

Closing thoughts

Prompt caching is the lowest-effort, highest-ROI optimization for LLM costs—no model changes, minimal code. Start with high-volume endpoints, restructure prompts for stable prefixes, and monitor hit rates. Combined with context compression (bundling) and smart routing, teams routinely achieve 70-90% total cost reductions while improving latency.

Need help architecting caching strategy for your workload? Reach out.