Thread Transfer
How context compression saves 60-80% on LLM costs
800 tokens down to 40 without losing meaning. We break down compression techniques that actually work in production.
Jorgo Bardho
Founder, Thread Transfer
LLM costs scale with tokens. Send 10,000 tokens per request instead of 50,000 and you cut your bill by 80%. Context compression makes this possible. Tools like LLMLingua can shrink prompts by 20x without losing meaning, and production teams are reporting 60-80% cost savings.
Types of compression
Not all compression is the same. Here are the main techniques, ordered from simplest to most sophisticated:
1. Manual summarization
Use a smaller, cheaper model (GPT-3.5, Claude Haiku) to summarize long documents before passing them to a larger model. Common pattern:
- User uploads a 50-page PDF.
- Chunk it into sections, summarize each with a small model (~$0.0001/section).
- Concatenate summaries and pass to the large model for final analysis.
Pros: Simple, controllable, works for any content type.
Cons: Lossy. Summaries miss nuance. Multi-pass increases latency.
2. Semantic compression (LLMLingua, LongLLMLingua)
LLMLingua uses a small language model to identify and remove low-importance tokens from a prompt while preserving semantic meaning. It compresses prompts by up to 20x (e.g., 800 tokens → 40 tokens) with minimal performance degradation.
How it works:
- Train or fine-tune a small LM to predict which tokens are essential for downstream task performance.
- Run the prompt through the compression model. It assigns importance scores to each token.
- Drop low-scoring tokens, keeping only the high-signal parts.
- Pass the compressed prompt to the target LLM (GPT-4, Claude, etc.).
Microsoft Research's LLMLingua achieves 2-5% accuracy loss at 10x compression, and often zero loss at 2-3x.
Pros: Massive token reduction, minimal accuracy loss, works on any text.
Cons: Requires running a separate model. Adds ~50-200ms latency.
3. Prompt optimization (DSPy, PromptPerfect)
Tools like DSPy automatically rewrite prompts to be more concise without changing intent. They strip filler words, consolidate repetitive instructions, and structure output constraints more efficiently.
Example: A 300-token prompt with verbose instructions can often be rewritten to 120 tokens by removing redundancy and tightening language.
Pros: One-time optimization. No runtime overhead.
Cons: Requires manual review. Effectiveness varies by domain.
4. Structured distillation (bundles)
Instead of compressing raw text, distill conversations into structured bundles that capture decisions, outcomes, and key facts while discarding filler.
Thread-Transfer bundles are a form of semantic compression: they take a 50-message thread (10k tokens) and produce a structured summary (2k tokens) that preserves the signal. Unlike LLMLingua, which operates on tokens, bundles operate on meaning—they extract decisions, timelines, and outcomes into reusable blocks.
Pros: Human-readable, portable, auditable. Can be versioned and reused.
Cons: Requires upfront distillation. Not suitable for real-time compression.
5. Retrieval-based reduction (RAG chunking)
Don't compress the entire context—just fetch the relevant parts. Use RAG to retrieve 3-5 chunks (1-2k tokens each) instead of dumping a 50k-token knowledge base into every prompt.
Pros: Scales to massive corpora. Only pay for what's retrieved.
Cons: Retrieval accuracy matters. Poor chunking or search degrades quality.
Implementation
Here's how to apply compression in a production pipeline:
Step 1: Measure baseline
Log every prompt's token count and cost. Calculate:
Cost per request = (input_tokens × input_price) + (output_tokens × output_price)Identify your highest-cost requests. Those are your compression targets.
Step 2: Pick a compression strategy
- If you have long documents: Use LLMLingua or summarization.
- If you have repetitive prompts: Optimize them with DSPy.
- If you have conversation history: Use bundles (Thread-Transfer).
- If you have a knowledge base: Use RAG chunking + hybrid search.
Step 3: Test accuracy vs compression ratio
Run A/B tests:
- Control group: Full, uncompressed prompts.
- Test group: Compressed prompts at varying ratios (2x, 5x, 10x).
Measure task success rate, output quality (human eval or LLM-as-judge), and user satisfaction. Find the compression ratio where quality is acceptable and cost savings are maximized.
Step 4: Deploy with monitoring
Roll out compression to production. Track:
- Token savings per request.
- Latency (compression adds overhead).
- Accuracy/quality metrics.
- Cost savings (in dollars).
Set alerts for quality drops. If compressed prompts start failing, roll back or adjust the compression threshold.
Trade-offs
Compression isn't free. Here's what to watch:
- Latency: LLMLingua adds 50-200ms. Summarization adds a full LLM call. For latency-sensitive apps, measure carefully.
- Quality: Every compression technique loses some information. Test whether the loss is acceptable for your use case.
- Complexity: More moving parts = more things to break. Compression pipelines need monitoring and fallback logic.
ROI calculations
Let's say you're making 1M requests/month at 20k tokens/request on GPT-4 Turbo:
Input cost = 1M requests × 20k tokens × $0.01/1k = $200,000/monthApply 5x compression (20k → 4k tokens):
New cost = 1M requests × 4k tokens × $0.01/1k = $40,000/month
Savings = $160,000/month (80%)Even if compression adds $10k/month in infrastructure (LLMLingua servers, distillation overhead), you're still saving $150k/month.
Best practices
- Compress early. The earlier in the pipeline you compress, the more you save downstream.
- Layer techniques. Use RAG to fetch relevant chunks, then compress those chunks with LLMLingua. Stacking works.
- Cache compressed prompts. If the same context is reused (system prompts, FAQs), compress once and cache the result.
- Test before deploying. Never compress production prompts without validating quality on a test set.
Context compression is the fastest way to cut LLM costs without changing models or reducing usage. Teams using LLMLingua, bundles, or RAG chunking are seeing 60-80% savings with minimal quality loss. If you're burning tokens on long prompts, compression isn't optional—it's table stakes.
Learn more: How it works · Why bundles beat raw thread history