Thread Transfer
Semantic vs Fixed Chunking: The Definitive Comparison
Fixed chunking is fast and cheap. Semantic chunking respects document structure. The answer is almost always hybrid—here's the decision framework.
Jorgo Bardho
Founder, Thread Transfer
Testing across 9 chunking strategies showed semantic chunking achieved the best accuracy with a 70% improvement over other methods. Yet most RAG systems still use fixed-size chunking because it's fast and simple. This guide breaks down both approaches with benchmarks, implementation patterns, and real-world cost tradeoffs.
The Chunking Problem
RAG systems split documents into chunks before embedding and storage. How you chunk determines what context your LLM receives. Fixed-size chunking splits text arbitrarily every N tokens. Semantic chunking respects document structure and meaning.
The stakes are high. According to Vectara's analysis of RAG systems, even models grounded in reference data can hallucinate anywhere from 1% to nearly 30% of the time if the retrieval context is flawed. Poor chunking is one of the primary causes of flawed retrieval.
Fixed-Size Chunking
Fixed-size chunking splits text into chunks of a predetermined size, often measured in tokens or characters. This method is easy to implement but does not respect the semantic structure of the text.
How It Works
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separator="\n"
)
chunks = splitter.split_text(document)Advantages
- Fast: No NLP libraries or API calls required
- Predictable chunk sizes: Easy to manage token limits
- Simple implementation: 5-10 lines of code
- Deterministic: Same input always produces same chunks
Drawbacks
- Context fragmentation: Splits sentences and paragraphs mid-thought
- Poor retrieval precision: Chunks may contain partial information
- Reduced semantic coherence: Related concepts scattered across chunks
Example: A 200-character chunk could easily split a single menu item in half, separating a dish's name from its price, or its description from its dietary information. This fragmentation makes it impossible for the retrieval system to find a complete, coherent piece of information.
Semantic Chunking
Semantic chunking, sometimes called intelligent chunking, focuses on preserving the document's meaning and structure. Instead of using a fixed chunk size, it strategically divides the document at meaningful breakpoints—like paragraphs, sentences, or thematically linked sections.
How It Works
Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
chunks = text_splitter.create_documents([document])Advantages
- 70% accuracy improvement: Best performance across benchmarks
- 9% higher recall: Retrieves more relevant context
- Preserves meaning: Chunks contain complete thoughts
- Better for complex documents: Technical docs, legal contracts, research papers
Drawbacks
- Higher compute cost: Requires embedding every sentence
- Slower processing: For a 10,000-word document, you might generate 200-300 embeddings
- Variable chunk sizes: Harder to predict token usage
- API dependencies: Requires embedding model access
Benchmark Comparison
| Method | Accuracy | Recall | Processing Speed | Cost (10K docs) |
|---|---|---|---|---|
| Fixed (200 chars) | 62% | 71% | 100 docs/sec | $2 |
| Fixed (512 tokens) | 74% | 78% | 95 docs/sec | $2 |
| RecursiveCharacterTextSplitter | 85% | 82% | 90 docs/sec | $3 |
| Semantic (embeddings) | 94% | 87% | 12 docs/sec | $45 |
| Page-level | 91% | 80% | 85 docs/sec | $3 |
Source: NVIDIA RAG benchmarks 2025, Chroma testing data. Costs assume OpenAI ada-002 embeddings.
RecursiveCharacterTextSplitter: The Middle Ground
RecursiveCharacterTextSplitter with 400-512 tokens delivered 85-90% recall in Chroma's tests without the computational overhead of semantic chunking, making it a solid default for most teams.
How It Works
This splitter tries to split on natural boundaries in order of preference:
- Double newlines (paragraphs)
- Single newlines
- Spaces
- Characters (last resort)
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(document)Why It Works
- Respects paragraph boundaries (better than fixed-size)
- Falls back gracefully when paragraphs are too long
- No API calls or embedding costs
- Fast enough for real-time processing
Advanced Chunking Strategies
Page-Level Chunking
Page-level chunking won NVIDIA's benchmarks with 0.648 accuracy and the lowest variance across document types. This approach treats each page as a chunk, which works well for structured documents like PDFs, presentations, and reports.
Best for: Slide decks, financial reports, legal documents with page-based organization.
Hierarchical Chunking
Create multiple chunk sizes from the same document: small chunks for precise retrieval, large chunks for context. Retrieve small, then expand to parent chunk for LLM context.
# Small chunks for retrieval
small_splitter = RecursiveCharacterTextSplitter(
chunk_size=200,
chunk_overlap=20
)
# Large chunks for context
large_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
small_chunks = small_splitter.split_text(doc)
large_chunks = large_splitter.split_text(doc)
# Map small chunks to parent large chunks
chunk_map = build_parent_map(small_chunks, large_chunks)Sliding Window Chunking
Create overlapping chunks to ensure context isn't lost at boundaries. The optimal configuration is typically 256-512 tokens with 10-20% overlap.
| Overlap % | Recall | Storage Cost | Retrieval Speed |
|---|---|---|---|
| 0% | 78% | 1x | 100% |
| 10% | 84% | 1.1x | 95% |
| 20% | 87% | 1.25x | 88% |
| 50% | 89% | 1.5x | 70% |
Cost Analysis
The cost difference between chunking strategies compounds at scale. For a system processing 1M documents per month:
| Strategy | Embedding Cost | Storage Cost | Total Monthly |
|---|---|---|---|
| Fixed (512 tokens) | $200 | $150 | $350 |
| Recursive | $300 | $180 | $480 |
| Semantic | $4,500 | $220 | $4,720 |
| Hierarchical | $400 | $300 | $700 |
Note: Assumes OpenAI ada-002 embeddings ($0.0001/1K tokens), Pinecone storage, average 2K tokens/doc.
Implementation Decision Tree
Use Fixed-Size Chunking When:
- Prototyping or validating RAG feasibility
- Processing simple, uniform documents (chat logs, transcripts)
- Budget or latency constraints are critical
- Document structure is inconsistent or unreliable
Use RecursiveCharacterTextSplitter When:
- Building production RAG for general-purpose content
- Documents have clear paragraph structure
- You need good accuracy without high compute costs
- This is the recommended starting point for 80% of use cases
Use Semantic Chunking When:
- Accuracy is paramount (legal, medical, financial domains)
- Documents are complex and technical
- Budget allows for 10-20x higher embedding costs
- You have time for slower processing (batch workflows)
Use Page-Level Chunking When:
- Processing PDFs, slide decks, or paginated documents
- Document layout conveys meaning (forms, reports)
- You want maximum consistency across document types
Use Hierarchical Chunking When:
- You need both precision and context
- Working with long-form content (books, research papers)
- Budget allows for 2x storage costs
- Retrieval quality is more important than speed
Production Best Practices
1. Start Simple, Measure, Iterate
Always start with RecursiveCharacterTextSplitter. It's the versatile, reliable workhorse of chunking. Use it to get your RAG system up and running and establish a performance baseline. Then measure:
- Retrieval precision (are the right chunks returned?)
- Answer accuracy (does the LLM generate correct responses?)
- Processing latency (chunk creation + embedding + retrieval)
- Cost per query (embedding + storage + inference)
2. Monitor Chunk Quality
Track these metrics in production:
- Average chunk size: Should be consistent (250-750 tokens)
- Chunk boundary quality: % of chunks ending mid-sentence
- Retrieval hit rate: % of queries finding relevant chunks
- Context sufficiency: % of retrieved chunks containing complete answers
3. Optimize for Your Document Types
Different content types need different strategies:
- Code: Chunk by function/class boundaries, not character count
- Structured documents: Use page-level or section-based chunking
- Conversational data: Chunk by message or turn boundaries
- Technical docs: Preserve heading hierarchy and code blocks
4. Test Chunk Overlap Settings
A 2025 benchmark from enterprise deployments shows that poorly chunked systems exhibit 3-5x higher query latency during peak load compared to semantically-aware chunking. Proper overlap prevents context loss at boundaries.
Thread Transfer's Chunking Approach
Thread Transfer uses semantic message-boundary chunking for conversation data. Instead of splitting mid-conversation, we:
- Identify natural conversation boundaries (topic shifts, time gaps)
- Group related messages into thematic chunks
- Preserve full message context (no mid-message splits)
- Extract and surface decision points, action items, and key stakeholders
This delivers 40-80% token savings compared to raw thread injection while maintaining higher accuracy than fixed-size chunking of conversation data.
Future: Adaptive Chunking
The next frontier is adaptive chunking: using LLMs to determine optimal chunk boundaries on a per-document basis. Early research shows 15-20% accuracy gains over static semantic chunking, but at 50-100x the compute cost.
For now, RecursiveCharacterTextSplitter remains the gold standard for production systems. Use semantic chunking when accuracy justifies the cost. Reserve fixed-size chunking for prototypes only.
Getting Started
Start with RecursiveCharacterTextSplitter at 512 tokens with 10% overlap. Measure retrieval precision and answer accuracy. If accuracy is below target, try semantic chunking on a sample. If costs are too high, optimize chunk size and overlap before changing methods.
The goal: complete, coherent chunks that give your LLM exactly the context it needs—no more, no less.
Learn more: How it works · Why bundles beat raw thread history