Thread Transfer

The context rot problem: Why more tokens isn't always better

Needle-in-haystack benchmarks reveal a painful truth: accuracy drops as context grows. Here's the science and the fix.

Jorgo Bardho

Founder, Thread Transfer

March 4, 2025•8 min read

context windowtoken limitsLLM performance

Graph showing accuracy decay as context window increases

More context should mean better answers, right? Not always. Research shows that as context windows grow, model accuracy can degrade. This phenomenon—called context rot or the "lost in the middle" problem—means stuffing your prompt full of information can backfire.

What is context rot?

Context rot happens when a model's performance drops as you increase the amount of context in the prompt. Even though the model technically "sees" all the information, it struggles to identify and use the relevant parts—especially if they're buried in the middle of a long passage.

Think of it like this: if you hand someone a 200-page document and ask them to find one critical fact, they might miss it even if it's right there. LLMs have the same problem at scale.

The research

The classic benchmark is the needle-in-haystack test: researchers hide a specific fact (the "needle") in a long document (the "haystack") and ask the model to retrieve it. Results consistently show:

Recency bias: Models perform best on information near the start or end of the context.
Middle blindness: Facts buried in the middle are frequently ignored or misremembered.
Diminishing returns: Beyond a certain threshold (often 20-40k tokens), adding more context doesn't improve accuracy and can actively hurt it.

A 2024 Stanford study found that GPT-4's retrieval accuracy dropped from 95% at 4k tokens to 70% at 64k tokens when the target information was in the middle third of the context.

Why it happens

Three factors contribute to context rot:

Attention dilution. Transformer models use attention mechanisms to weigh which parts of the input matter most. When context is long, attention gets spread thin, and the model "loses focus."
Noise overwhelms signal. Long contexts often include irrelevant information. The more noise, the harder it is for the model to identify what matters.
Training distribution mismatch. Most LLMs are trained on shorter documents. When you push them to 100k+ tokens, you're outside the distribution they've seen, and performance degrades.

Solutions

The fix isn't to avoid long context—it's to curate and structure what goes in. Here's how:

Compress intelligently. Use summarization, distillation, or tools like LLMLingua to shrink context while preserving meaning. Thread-Transfer's bundles are built for this: they distill long threads into compact, structured blocks that keep the signal and drop the noise.
Chunk and retrieve. Don't dump everything at once. Use RAG to fetch only the relevant chunks for each query. Semantic search, hybrid search, and query augmentation help you surface the right pieces.
Front-load critical info. If something is essential, put it near the start or end of the prompt. Don't bury it in the middle.
Use structured formats. JSON, markdown tables, and bullet lists make it easier for models to parse and extract information than long paragraphs.
Test retrieval accuracy. Run your own needle-in-haystack tests on your context payloads. Measure whether the model can consistently find facts you've embedded.

Best practices

Set a token budget. Treat context like memory. Allocate a fixed budget (e.g., 20k tokens) and compress or filter to fit within it.
Track accuracy over context length. Log how model performance changes as context grows. You'll find a sweet spot where accuracy plateaus or drops.
Prefer quality over quantity. One tightly scoped, high-signal bundle beats ten rambling transcripts.

Context rot is real. More tokens don't automatically mean better answers. The teams winning in 2025 are the ones who treat context as a scarce resource, compress it ruthlessly, and deliver only what the model needs.

Learn more: How it works · Why bundles beat raw thread history