Thread Transfer

Semantic vs Fixed Chunking: The Definitive Comparison

Fixed chunking is fast and cheap. Semantic chunking respects document structure. The answer is almost always hybrid—here's the decision framework.

Jorgo Bardho

Founder, Thread Transfer

July 12, 2025•16 min read

chunkingRAGsemanticretrievalcontext management

Testing across 9 chunking strategies showed semantic chunking achieved the best accuracy with a 70% improvement over other methods. Yet most RAG systems still use fixed-size chunking because it's fast and simple. This guide breaks down both approaches with benchmarks, implementation patterns, and real-world cost tradeoffs.

The Chunking Problem

RAG systems split documents into chunks before embedding and storage. How you chunk determines what context your LLM receives. Fixed-size chunking splits text arbitrarily every N tokens. Semantic chunking respects document structure and meaning.

The stakes are high. According to Vectara's analysis of RAG systems, even models grounded in reference data can hallucinate anywhere from 1% to nearly 30% of the time if the retrieval context is flawed. Poor chunking is one of the primary causes of flawed retrieval.

Fixed-Size Chunking

Fixed-size chunking splits text into chunks of a predetermined size, often measured in tokens or characters. This method is easy to implement but does not respect the semantic structure of the text.

How It Works

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separator="\n"
)

chunks = splitter.split_text(document)

Advantages

Fast: No NLP libraries or API calls required
Predictable chunk sizes: Easy to manage token limits
Simple implementation: 5-10 lines of code
Deterministic: Same input always produces same chunks

Drawbacks

Context fragmentation: Splits sentences and paragraphs mid-thought
Poor retrieval precision: Chunks may contain partial information
Reduced semantic coherence: Related concepts scattered across chunks

Example: A 200-character chunk could easily split a single menu item in half, separating a dish's name from its price, or its description from its dietary information. This fragmentation makes it impossible for the retrieval system to find a complete, coherent piece of information.

Semantic Chunking

Semantic chunking, sometimes called intelligent chunking, focuses on preserving the document's meaning and structure. Instead of using a fixed chunk size, it strategically divides the document at meaningful breakpoints—like paragraphs, sentences, or thematically linked sections.

How It Works

Semantic chunking involves taking the embeddings of every sentence in the document, comparing the similarity of all sentences with each other, and then grouping sentences with the most similar embeddings together.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile"
)

chunks = text_splitter.create_documents([document])

Advantages

70% accuracy improvement: Best performance across benchmarks
9% higher recall: Retrieves more relevant context
Preserves meaning: Chunks contain complete thoughts
Better for complex documents: Technical docs, legal contracts, research papers

Drawbacks

Higher compute cost: Requires embedding every sentence
Slower processing: For a 10,000-word document, you might generate 200-300 embeddings
Variable chunk sizes: Harder to predict token usage
API dependencies: Requires embedding model access

Benchmark Comparison

Method	Accuracy	Recall	Processing Speed	Cost (10K docs)
Fixed (200 chars)	62%	71%	100 docs/sec	$2
Fixed (512 tokens)	74%	78%	95 docs/sec	$2
RecursiveCharacterTextSplitter	85%	82%	90 docs/sec	$3
Semantic (embeddings)	94%	87%	12 docs/sec	$45
Page-level	91%	80%	85 docs/sec	$3

Source: NVIDIA RAG benchmarks 2025, Chroma testing data. Costs assume OpenAI ada-002 embeddings.

RecursiveCharacterTextSplitter: The Middle Ground

RecursiveCharacterTextSplitter with 400-512 tokens delivered 85-90% recall in Chroma's tests without the computational overhead of semantic chunking, making it a solid default for most teams.

How It Works

This splitter tries to split on natural boundaries in order of preference:

Double newlines (paragraphs)
Single newlines
Spaces
Characters (last resort)

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(document)

Why It Works

Respects paragraph boundaries (better than fixed-size)
Falls back gracefully when paragraphs are too long
No API calls or embedding costs
Fast enough for real-time processing

Advanced Chunking Strategies

Page-Level Chunking

Page-level chunking won NVIDIA's benchmarks with 0.648 accuracy and the lowest variance across document types. This approach treats each page as a chunk, which works well for structured documents like PDFs, presentations, and reports.

Best for: Slide decks, financial reports, legal documents with page-based organization.

Hierarchical Chunking

Create multiple chunk sizes from the same document: small chunks for precise retrieval, large chunks for context. Retrieve small, then expand to parent chunk for LLM context.

# Small chunks for retrieval
small_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)

# Large chunks for context
large_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

small_chunks = small_splitter.split_text(doc)
large_chunks = large_splitter.split_text(doc)

# Map small chunks to parent large chunks
chunk_map = build_parent_map(small_chunks, large_chunks)

Sliding Window Chunking

Create overlapping chunks to ensure context isn't lost at boundaries. The optimal configuration is typically 256-512 tokens with 10-20% overlap.

Overlap %	Recall	Storage Cost	Retrieval Speed
0%	78%	1x	100%
10%	84%	1.1x	95%
20%	87%	1.25x	88%
50%	89%	1.5x	70%

Cost Analysis

The cost difference between chunking strategies compounds at scale. For a system processing 1M documents per month:

Strategy	Embedding Cost	Storage Cost	Total Monthly
Fixed (512 tokens)	$200	$150	$350
Recursive	$300	$180	$480
Semantic	$4,500	$220	$4,720
Hierarchical	$400	$300	$700

Note: Assumes OpenAI ada-002 embeddings ($0.0001/1K tokens), Pinecone storage, average 2K tokens/doc.

Implementation Decision Tree

Use Fixed-Size Chunking When:

Prototyping or validating RAG feasibility
Processing simple, uniform documents (chat logs, transcripts)
Budget or latency constraints are critical
Document structure is inconsistent or unreliable

Use RecursiveCharacterTextSplitter When:

Building production RAG for general-purpose content
Documents have clear paragraph structure
You need good accuracy without high compute costs
This is the recommended starting point for 80% of use cases

Use Semantic Chunking When:

Accuracy is paramount (legal, medical, financial domains)
Documents are complex and technical
Budget allows for 10-20x higher embedding costs
You have time for slower processing (batch workflows)

Use Page-Level Chunking When:

Processing PDFs, slide decks, or paginated documents
Document layout conveys meaning (forms, reports)
You want maximum consistency across document types

Use Hierarchical Chunking When:

You need both precision and context
Working with long-form content (books, research papers)
Budget allows for 2x storage costs
Retrieval quality is more important than speed

Production Best Practices

1. Start Simple, Measure, Iterate

Always start with RecursiveCharacterTextSplitter. It's the versatile, reliable workhorse of chunking. Use it to get your RAG system up and running and establish a performance baseline. Then measure:

Retrieval precision (are the right chunks returned?)
Answer accuracy (does the LLM generate correct responses?)
Processing latency (chunk creation + embedding + retrieval)
Cost per query (embedding + storage + inference)

2. Monitor Chunk Quality

Track these metrics in production:

Average chunk size: Should be consistent (250-750 tokens)
Chunk boundary quality: % of chunks ending mid-sentence
Retrieval hit rate: % of queries finding relevant chunks
Context sufficiency: % of retrieved chunks containing complete answers

3. Optimize for Your Document Types

Different content types need different strategies:

Code: Chunk by function/class boundaries, not character count
Structured documents: Use page-level or section-based chunking
Conversational data: Chunk by message or turn boundaries
Technical docs: Preserve heading hierarchy and code blocks

4. Test Chunk Overlap Settings

A 2025 benchmark from enterprise deployments shows that poorly chunked systems exhibit 3-5x higher query latency during peak load compared to semantically-aware chunking. Proper overlap prevents context loss at boundaries.

Thread Transfer's Chunking Approach

Thread Transfer uses semantic message-boundary chunking for conversation data. Instead of splitting mid-conversation, we:

Identify natural conversation boundaries (topic shifts, time gaps)
Group related messages into thematic chunks
Preserve full message context (no mid-message splits)
Extract and surface decision points, action items, and key stakeholders

This delivers 40-80% token savings compared to raw thread injection while maintaining higher accuracy than fixed-size chunking of conversation data.

Future: Adaptive Chunking

The next frontier is adaptive chunking: using LLMs to determine optimal chunk boundaries on a per-document basis. Early research shows 15-20% accuracy gains over static semantic chunking, but at 50-100x the compute cost.

For now, RecursiveCharacterTextSplitter remains the gold standard for production systems. Use semantic chunking when accuracy justifies the cost. Reserve fixed-size chunking for prototypes only.

Getting Started

Start with RecursiveCharacterTextSplitter at 512 tokens with 10% overlap. Measure retrieval precision and answer accuracy. If accuracy is below target, try semantic chunking on a sample. If costs are too high, optimize chunk size and overlap before changing methods.

The goal: complete, coherent chunks that give your LLM exactly the context it needs—no more, no less.

Learn more: How it works · Why bundles beat raw thread history