Thread Transfer

Context Window Management at Scale

Gemini 1.5 Pro offers 1M tokens. Claude 3 supports 200K. But attention degradation kicks in around 32K. Here's how to manage context at scale without losing signal.

Jorgo Bardho

Founder, Thread Transfer

July 11, 2025•17 min read

context windowscalingenterprise AItoken optimization

Context windows have exploded from 8K tokens to 100 million tokens in just two years. Llama 4 ships with a 10M token window. Magic.dev's LTM-2-Mini handles 100M tokens—equivalent to 750 novels or 10 million lines of code. Yet production systems still hit the wall. Why? Because context window management at scale isn't about size. It's about attention distribution, memory hierarchy, and cost control.

The Context Window Explosion

Since mid-2023, the longest LLM context windows have grown by about 30x per year. More importantly, models' ability to use that input effectively is improving even faster: on two long-context benchmarks, the input length where top models reach 80% accuracy has risen by over 250x in the past 9 months.

Today, frontier models offer context windows that are no more than 1-2 million tokens. That amounts to a few thousand code files, which is still less than most production codebases of enterprise customers. Any workflow that relies on simply adding everything to context collides with a hard wall.

Model	Context Window	Equivalent Capacity
GPT-3.5 (2022)	4K tokens	~3 pages
GPT-4 (2023)	8K–32K tokens	~24 pages
Claude 3.5 Sonnet	200K tokens	~150 pages
Gemini 1.5 Pro	2M tokens	~1,500 pages
Llama 4	10M tokens	~7,500 pages
Magic.dev LTM-2-Mini	100M tokens	~75,000 pages

The Context Rot Problem

Model attention is not uniform across long sequences of context. Chroma's research report on Context Rot (Hong et al., 2025) measured 18 LLMs and found that "models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows."

The self-attention computation scales quadratically with the number of tokens. Doubling the token count requires roughly fourfold the compute effort. That growth impacts inference latency, memory usage, and cost, especially when serving enterprise-scale workflows with tight response time requirements.

Real-World Impact

Attention dilution: In many cases, adding more documents degrades performance as the model's attention spreads too thin.
Cost explosion: Token pricing turns naive "just stuff more code" strategies into untenable OpEx for organizations with large engineering teams.
Latency degradation: A 2025 benchmark from enterprise deployments shows that poorly chunked systems exhibit 3-5x higher query latency during peak load.

Context Management Strategies at Scale

Effective agentic systems must treat context the way operating systems treat memory and CPU cycles: as finite resources to be budgeted, compacted, and intelligently paged. Here are the proven approaches.

1. Direct Context Injection (Baseline)

The baseline strategy for context management should be to put all required context directly in the LLM context window. Frontier LLMs are excellent at managing and navigating large volumes of structured context. This is the easiest and simplest engineering solution and should be used until it stops working.

When context is appropriately separated using XML-style tags, for example, the LLM typically handles as much context as you can provide. The direct context injection strategy works well with short-medium size documents, but will eventually fill up the context window.

2. Intelligent Scoping

Qodo does not try to stuff the entire repo into the context. It intelligently scopes context, ensuring reviews remain precise without hitting length or focus limits. This approach reduces token usage by 60-80% compared to naive full-context injection while maintaining or improving accuracy.

Key scoping techniques:

Dependency graph analysis to identify relevant code paths
Call hierarchy traversal to include only touched functions
Test coverage mapping to prioritize critical code sections
Historical change frequency analysis to focus on volatile areas

3. Cache Augmented Generation (CAG)

CAG pre-computes documents and caches the results as part of a prompt so that those documents' context is available to the LLM when it generates completions. CAG can improve generation latency compared to RAG because there is no extra retrieval step to find relevant documents. Instead, all the documents are available in all prompts—as long as both the cached documents and user prompt fit within the context window.

CAG is ideal for:

Frequently accessed documentation (API specs, style guides)
Static reference materials that rarely change
User-specific context that persists across sessions

4. Infinite Retrieval and Cascading KV Cache

Infinite Retrieval and Cascading KV Cache push LLMs closer to human-like context handling. They sidestep the memory crunch without retraining, promising AI that can nail a question from a 1M-token doc or chat coherently for hours.

These techniques work by:

Storing key-value pairs in a hierarchical cache structure
Evicting less relevant context based on attention scores
Retrieving evicted context on-demand when needed
Compressing older context into summary representations

5. Agentic Retrieval

Agentic Retrieval can handle much longer documents, as documents are only added to the context as required. The agent decides what context to retrieve based on the current task state and reasoning chain.

This approach is particularly effective for:

Multi-step reasoning tasks requiring selective information gathering
Code generation across large repositories (10K+ files)
Long-running debugging sessions spanning multiple modules
Research tasks requiring synthesis from diverse sources

Cost Economics of Long Context

Every additional token processed by an LLM incurs a direct cost. For large repositories or complex tasks, the difference between a curated, targeted prompt and a brute-force full-context approach can mean orders of magnitude in operational expenses.

Strategy	Avg. Tokens/Query	Monthly Cost (1K queries)	Latency
Full repo injection	500K	$15,000	8-12s
Intelligent scoping	50K	$1,500	2-4s
CAG + scoping	30K	$900	1-2s
Agentic retrieval	25K	$750	3-5s

Note: Costs estimated using Claude Opus 4.5 pricing ($15/1M input tokens). Actual costs vary based on model selection and caching strategies.

Architecture Best Practices

Research has shown that the context you provide influences the performance of the language model. The difference between a well-managed and poorly-managed context can be the difference between insightful, accurate responses and confused, inconsistent ones.

Tiered Context Strategy

Hot tier (always in context): Critical system instructions, user preferences, current task state. Target: <5K tokens.
Warm tier (cached, frequently accessed): API documentation, code standards, common utilities. Target: 20-50K tokens.
Cold tier (retrieved on-demand): Historical conversations, full codebase, archived documentation. Retrieved selectively.

Context Structuring

Use explicit separators and metadata to help models navigate large contexts:

<context_bundle>
  <metadata>
    <source>github/repo-name</source>
    <timestamp>2025-07-11T10:30:00Z</timestamp>
    <relevance_score>0.92</relevance_score>
  </metadata>
  <content>
    // Actual file content
  </content>
</context_bundle>

Monitoring and Telemetry

Track these metrics to optimize context window usage:

Token utilization ratio: Actual tokens used / context window size
Attention entropy: How evenly attention is distributed (lower is better)
Context hit rate: % of context actually referenced in output
Retrieval precision: Relevance of retrieved context to final answer
Cost per successful query: Total token cost / successful outputs

Thread Transfer's Context Bundle Approach

Thread Transfer bundles are designed for the warm tier: conversation history and project context that's too large for hot storage but too important for cold retrieval. Bundles compress 100-message Slack threads into 2-5K token summaries while preserving decision rationale, key stakeholders, and action items.

This approach delivers:

40-80% token savings compared to raw thread injection
Deterministic context: Same bundle always produces the same context
Portable across sessions: Load bundle once, use it everywhere
Composable with other strategies: Combine with CAG, retrieval, or direct injection

The Future: Context as Infrastructure

The expansion of context windows to millions of tokens represents a significant advancement in LLM capabilities, but it's not a silver bullet. The most successful implementations will be those that thoughtfully consider when to leverage large contexts, when to rely on retrieval, and how to balance performance, cost, and quality considerations.

As Microsoft's deputy CTO Sam Schillace noted, "To be autonomous you have to carry context through a bunch of actions, but the models are very disconnected and don't have continuity the way we do." The answer isn't just bigger windows—it's smarter context orchestration.

Organizations that master context window management at scale will:

Reduce AI infrastructure costs by 60-90%
Improve response latency by 3-5x
Increase output accuracy and reduce hallucinations
Enable truly autonomous, long-running AI agents

Getting Started

Start with direct context injection. Measure token usage, latency, and cost per query. When you hit limits, add intelligent scoping. Once scoping is optimized, layer in caching for frequently accessed materials. Finally, implement agentic retrieval for edge cases and long-tail queries.

The goal isn't to use the largest context window available. The goal is to use the right amount of context, structured the right way, delivered at the right time.

Learn more: How it works · Why bundles beat raw thread history