Thread Transfer

Chunking strategies for document RAG: What the research says

Too small and you lose context. Too big and you waste tokens. Here's what the research says about optimal chunk sizes.

Jorgo Bardho

Founder, Thread Transfer

March 21, 2025•10 min read

document chunkingRAG chunk sizetext splitting

Chunking is the most underrated lever in RAG accuracy. Split docs into 50-token chunks and you lose all context. Split into 2000-token chunks and you dilute relevance and waste tokens. Research shows optimal chunk size varies by document type, query complexity, and model context window. This guide breaks down what works, what doesn't, and how to test your way to the right strategy.

Why chunking matters: The precision-context trade-off

Vector search retrieves entire chunks. If a chunk is too small, it lacks the context needed to answer the question. If it's too large, it contains irrelevant information that wastes tokens and confuses the model. The goal: maximize signal-to-noise in every retrieved chunk.

Example: A user asks "What's our refund policy for enterprise customers?"

50-token chunk: "Enterprise customers are eligible for refunds." (No details on conditions or process)
500-token chunk: "Enterprise customers are eligible for refunds within 30 days if purchased via the annual plan. Contact support@company.com with invoice ID..." (Perfect)
2000-token chunk: Entire refund policy doc including consumer, SMB, and enterprise sections. (Too much noise)

The 500-token chunk wins because it contains the full answer without extraneous context.

Strategies compared: Fixed vs semantic vs hierarchical

Fixed-size chunking. Split every X tokens (e.g., 400 tokens per chunk) with Y% overlap (e.g., 20%). Simple to implement. Works well for homogeneous docs (wikis, FAQs). Breaks at section boundaries—you might split a bullet list or code block mid-sentence. Use this as your baseline.

Semantic chunking. Use NLP (sentence boundaries, paragraph breaks, heading detection) to split at logical boundaries. Preserves structure. Outperforms fixed-size by 10-15% on structured docs (Confluence, Notion, Markdown). Costs more compute upfront but worth it if your docs have clear sections. Libraries: LangChainRecursiveCharacterTextSplitter, LlamaIndex SentenceSplitter.

Hierarchical chunking. Create parent chunks (1000 tokens) and child chunks (200 tokens). Index child chunks for retrieval. When a child matches, return the parent for context. This gives you precise retrieval with broad context. Best for long-form docs (research papers, manuals, legal contracts). Adds complexity but improves accuracy 20-30% on multi-hop queries.

Agentic/dynamic chunking. Let an LLM decide chunk boundaries based on semantic coherence. Emerging technique, not production-ready for most teams yet. Expensive (LLM call per doc) but yields the cleanest chunks. Watch this space for 2026.

Research findings: What the data says about optimal chunk size

Greg Kamradt's 2024 study (5 Levels of Text Splitting) tested chunk sizes from 50 to 2000 tokens across multiple document types. Key findings:

200-500 tokens is the sweet spot for FAQ docs and wikis (single-hop queries)
400-800 tokens works best for technical docs (API references, how-tos) where examples span multiple paragraphs
Overlap matters: 10-20% overlap improves retrieval by 8-12% by ensuring boundary context isn't lost
Semantic chunking beats fixed-size by 15% on structured docs but shows no advantage on unstructured text

OpenAI's RAG cookbook (2024) recommends starting with 400 tokens + 20% overlap for general use cases, then tuning based on your test set.

Implementation guide: Test-driven chunking

Don't guess. Test your way to the right strategy:

Step 1: Baseline with fixed-size. Start with 400 tokens, 20% overlap. Chunk your pilot corpus (1,000 docs). Index and retrieve on 50 test queries. Measure top-5 precision and recall.
Step 2: Try semantic chunking. Use LangChain's RecursiveCharacterTextSplitter with sentence and paragraph boundaries. Re-index. Measure again. If precision improves by 10%+, adopt semantic chunking. Otherwise, stick with fixed-size.
Step 3: Tune chunk size. Test 200, 400, 600, and 800 tokens. Plot precision/recall curves. Pick the size that maximizes F1 score on your test set. For most teams, this lands between 300-600 tokens.
Step 4: Experiment with overlap. Test 0%, 10%, 20%, and 30% overlap. Measure retrieval quality. Overlap helps, but beyond 20% you get diminishing returns and increased index size.
Step 5 (optional): Hierarchical chunking. If your docs are long (5k+ tokens) and queries require broad context, implement parent-child chunking. Measure if it outperforms flat chunking by 15%+. If not, skip it.

Testing approach: Build a chunk quality harness

Your chunk strategy is only as good as your test set. Build a harness:

Collect 50-100 real user queries
For each query, label which doc sections should appear in top-5 results (ground truth)
Chunk your corpus with different strategies
Retrieve for each query and compute precision@5, recall@5, and F1
Pick the strategy that maximizes F1

Re-run this harness quarterly. As your corpus grows and query patterns shift, optimal chunk size may drift.

Common mistakes and how to avoid them

Mistake 1: One size fits all. Don't use the same chunk size for Slack threads (200 tokens ideal) and API docs (600 tokens ideal). Segment your corpus by document type and chunk accordingly.

Mistake 2: Ignoring metadata. Prepend chunk metadata (doc title, section heading, date) before embedding. "The policy changed in Q3" is useless without knowing which policy. Prepend:"[Refund Policy - Enterprise Section] The policy changed in Q3."

Mistake 3: Skipping overlap. Zero overlap means boundary sentences get split awkwardly. Always use 10-20% overlap to preserve cross-boundary context.

Mistake 4: No A/B testing. Don't assume 400 tokens is optimal because a blog post said so. Test on your data. Build the harness. Let the numbers decide.

Bottom line: Chunking is a lever you can pull to improve RAG accuracy by 20-30% without changing models or embeddings. Start with 400 tokens + 20% overlap. Measure. Tune. Repeat.

Learn more: How it works · Why bundles beat raw thread history