Thread Transfer
Context Window Management at Scale
Gemini 1.5 Pro offers 1M tokens. Claude 3 supports 200K. But attention degradation kicks in around 32K. Here's how to manage context at scale without losing signal.
Jorgo Bardho
Founder, Thread Transfer
Context windows have exploded from 8K tokens to 100 million tokens in just two years. Llama 4 ships with a 10M token window. Magic.dev's LTM-2-Mini handles 100M tokens—equivalent to 750 novels or 10 million lines of code. Yet production systems still hit the wall. Why? Because context window management at scale isn't about size. It's about attention distribution, memory hierarchy, and cost control.
The Context Window Explosion
Since mid-2023, the longest LLM context windows have grown by about 30x per year. More importantly, models' ability to use that input effectively is improving even faster: on two long-context benchmarks, the input length where top models reach 80% accuracy has risen by over 250x in the past 9 months.
Today, frontier models offer context windows that are no more than 1-2 million tokens. That amounts to a few thousand code files, which is still less than most production codebases of enterprise customers. Any workflow that relies on simply adding everything to context collides with a hard wall.
| Model | Context Window | Equivalent Capacity |
|---|---|---|
| GPT-3.5 (2022) | 4K tokens | ~3 pages |
| GPT-4 (2023) | 8K–32K tokens | ~24 pages |
| Claude 3.5 Sonnet | 200K tokens | ~150 pages |
| Gemini 1.5 Pro | 2M tokens | ~1,500 pages |
| Llama 4 | 10M tokens | ~7,500 pages |
| Magic.dev LTM-2-Mini | 100M tokens | ~75,000 pages |
The Context Rot Problem
Model attention is not uniform across long sequences of context. Chroma's research report on Context Rot (Hong et al., 2025) measured 18 LLMs and found that "models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows."
The self-attention computation scales quadratically with the number of tokens. Doubling the token count requires roughly fourfold the compute effort. That growth impacts inference latency, memory usage, and cost, especially when serving enterprise-scale workflows with tight response time requirements.
Real-World Impact
- Attention dilution: In many cases, adding more documents degrades performance as the model's attention spreads too thin.
- Cost explosion: Token pricing turns naive "just stuff more code" strategies into untenable OpEx for organizations with large engineering teams.
- Latency degradation: A 2025 benchmark from enterprise deployments shows that poorly chunked systems exhibit 3-5x higher query latency during peak load.
Context Management Strategies at Scale
Effective agentic systems must treat context the way operating systems treat memory and CPU cycles: as finite resources to be budgeted, compacted, and intelligently paged. Here are the proven approaches.
1. Direct Context Injection (Baseline)
The baseline strategy for context management should be to put all required context directly in the LLM context window. Frontier LLMs are excellent at managing and navigating large volumes of structured context. This is the easiest and simplest engineering solution and should be used until it stops working.
When context is appropriately separated using XML-style tags, for example, the LLM typically handles as much context as you can provide. The direct context injection strategy works well with short-medium size documents, but will eventually fill up the context window.
2. Intelligent Scoping
Qodo does not try to stuff the entire repo into the context. It intelligently scopes context, ensuring reviews remain precise without hitting length or focus limits. This approach reduces token usage by 60-80% compared to naive full-context injection while maintaining or improving accuracy.
Key scoping techniques:
- Dependency graph analysis to identify relevant code paths
- Call hierarchy traversal to include only touched functions
- Test coverage mapping to prioritize critical code sections
- Historical change frequency analysis to focus on volatile areas
3. Cache Augmented Generation (CAG)
CAG pre-computes documents and caches the results as part of a prompt so that those documents' context is available to the LLM when it generates completions. CAG can improve generation latency compared to RAG because there is no extra retrieval step to find relevant documents. Instead, all the documents are available in all prompts—as long as both the cached documents and user prompt fit within the context window.
CAG is ideal for:
- Frequently accessed documentation (API specs, style guides)
- Static reference materials that rarely change
- User-specific context that persists across sessions
4. Infinite Retrieval and Cascading KV Cache
Infinite Retrieval and Cascading KV Cache push LLMs closer to human-like context handling. They sidestep the memory crunch without retraining, promising AI that can nail a question from a 1M-token doc or chat coherently for hours.
These techniques work by:
- Storing key-value pairs in a hierarchical cache structure
- Evicting less relevant context based on attention scores
- Retrieving evicted context on-demand when needed
- Compressing older context into summary representations
5. Agentic Retrieval
Agentic Retrieval can handle much longer documents, as documents are only added to the context as required. The agent decides what context to retrieve based on the current task state and reasoning chain.
This approach is particularly effective for:
- Multi-step reasoning tasks requiring selective information gathering
- Code generation across large repositories (10K+ files)
- Long-running debugging sessions spanning multiple modules
- Research tasks requiring synthesis from diverse sources
Cost Economics of Long Context
Every additional token processed by an LLM incurs a direct cost. For large repositories or complex tasks, the difference between a curated, targeted prompt and a brute-force full-context approach can mean orders of magnitude in operational expenses.
| Strategy | Avg. Tokens/Query | Monthly Cost (1K queries) | Latency |
|---|---|---|---|
| Full repo injection | 500K | $15,000 | 8-12s |
| Intelligent scoping | 50K | $1,500 | 2-4s |
| CAG + scoping | 30K | $900 | 1-2s |
| Agentic retrieval | 25K | $750 | 3-5s |
Note: Costs estimated using Claude Opus 4.5 pricing ($15/1M input tokens). Actual costs vary based on model selection and caching strategies.
Architecture Best Practices
Research has shown that the context you provide influences the performance of the language model. The difference between a well-managed and poorly-managed context can be the difference between insightful, accurate responses and confused, inconsistent ones.
Tiered Context Strategy
- Hot tier (always in context): Critical system instructions, user preferences, current task state. Target: <5K tokens.
- Warm tier (cached, frequently accessed): API documentation, code standards, common utilities. Target: 20-50K tokens.
- Cold tier (retrieved on-demand): Historical conversations, full codebase, archived documentation. Retrieved selectively.
Context Structuring
Use explicit separators and metadata to help models navigate large contexts:
<context_bundle>
<metadata>
<source>github/repo-name</source>
<timestamp>2025-07-11T10:30:00Z</timestamp>
<relevance_score>0.92</relevance_score>
</metadata>
<content>
// Actual file content
</content>
</context_bundle>Monitoring and Telemetry
Track these metrics to optimize context window usage:
- Token utilization ratio: Actual tokens used / context window size
- Attention entropy: How evenly attention is distributed (lower is better)
- Context hit rate: % of context actually referenced in output
- Retrieval precision: Relevance of retrieved context to final answer
- Cost per successful query: Total token cost / successful outputs
Thread Transfer's Context Bundle Approach
Thread Transfer bundles are designed for the warm tier: conversation history and project context that's too large for hot storage but too important for cold retrieval. Bundles compress 100-message Slack threads into 2-5K token summaries while preserving decision rationale, key stakeholders, and action items.
This approach delivers:
- 40-80% token savings compared to raw thread injection
- Deterministic context: Same bundle always produces the same context
- Portable across sessions: Load bundle once, use it everywhere
- Composable with other strategies: Combine with CAG, retrieval, or direct injection
The Future: Context as Infrastructure
The expansion of context windows to millions of tokens represents a significant advancement in LLM capabilities, but it's not a silver bullet. The most successful implementations will be those that thoughtfully consider when to leverage large contexts, when to rely on retrieval, and how to balance performance, cost, and quality considerations.
As Microsoft's deputy CTO Sam Schillace noted, "To be autonomous you have to carry context through a bunch of actions, but the models are very disconnected and don't have continuity the way we do." The answer isn't just bigger windows—it's smarter context orchestration.
Organizations that master context window management at scale will:
- Reduce AI infrastructure costs by 60-90%
- Improve response latency by 3-5x
- Increase output accuracy and reduce hallucinations
- Enable truly autonomous, long-running AI agents
Getting Started
Start with direct context injection. Measure token usage, latency, and cost per query. When you hit limits, add intelligent scoping. Once scoping is optimized, layer in caching for frequently accessed materials. Finally, implement agentic retrieval for edge cases and long-tail queries.
The goal isn't to use the largest context window available. The goal is to use the right amount of context, structured the right way, delivered at the right time.
Learn more: How it works · Why bundles beat raw thread history