Thread Transfer

Parent Document Retrieval: Context-Aware Chunking

Small chunks match queries better. Large chunks provide context. Parent document retrieval gives you both—here's the implementation guide.

Jorgo Bardho

Founder, Thread Transfer

July 29, 2025•14 min read

RAGparent documentchunkingcontextretrieval

Small chunks maximize retrieval precision. Large chunks preserve context. Parent document retrieval gives you both: retrieve with small chunks, return with full parent documents. LangChain's ParentDocumentRetriever and similar patterns improve RAG accuracy by 15-30% on queries requiring broad context, while maintaining precise matching. This guide covers sentence window retrieval, auto-merging retrieval, and small-to-big chunking strategies.

The chunking dilemma: Precision vs context

Standard RAG chunks documents into 200-500 token pieces. This creates a trade-off:

Small chunks (100-200 tokens): Precise retrieval. Embedding captures specific concepts. But context is lost—a chunk about "Q3 changes" doesn't include the background from Q2.
Large chunks (500-1000 tokens): Rich context. But embedding becomes generic, mixing multiple concepts. Retrieval precision drops—irrelevant content dilutes the match.

In production, small chunks give 70-80% retrieval precision but 55-65% answer accuracy (context missing). Large chunks give 55-65% precision but 65-75% accuracy (more context, but noisier retrieval). Parent document retrieval solves this by retrieving small, returning big.

Parent document retrieval: How it works

The pattern has three components:

Split documents hierarchically: Create small "child" chunks (50-200 tokens) linked to large "parent" chunks (500-1500 tokens) or full documents.
Index child chunks: Embed and store only the small chunks in your vector database. These give precise retrieval.
Return parent chunks: At query time, retrieve top-k child chunks. Then replace them with their parent chunks before passing to the LLM. The LLM sees full context, not fragments.

This decouples retrieval granularity (small = precise) from LLM input granularity (large = contextual).

Three variants: Sentence window, auto-merging, small-to-big

1. Sentence window retrieval

Index individual sentences. When a sentence matches, return it plus N surrounding sentences (the "window").

Example:

Document:

[1] Our refund policy changed in Q3 2024.
[2] Previously, EU customers had 14 days.
[3] The new policy extends this to 30 days.
[4] US customers remain at 14 days.

Query: "What's the new EU refund policy?"

Retrieved sentence: [3] "The new policy extends this to 30 days."
Returned window (N=1): [2] + [3] + [4]

The LLM now sees context (Q3 change, previous policy) while retrieval targeted the precise sentence.

Best for: Documents with clear sentence-level semantics (policies, technical docs)

2. Auto-merging retrieval

Index small chunks. If multiple child chunks from the same parent are retrieved (e.g., 3 out of 5 children in top-10 results), automatically "merge up" and return the parent chunk instead.

Example:

Parent chunk: "Q3 2024 Refund Policy Update" (500 tokens)

Child chunks:

Child 1: "EU customers now have 30 days..."
Child 2: "Previous policy was 14 days..."
Child 3: "US policy unchanged at 14 days..."

Query: "Explain the Q3 refund policy changes"

Retrieved: Child 1 (rank 2), Child 2 (rank 5), Child 3 (rank 8)
Action: 3/3 children retrieved → merge to parent
Returned: Full "Q3 2024 Refund Policy Update" parent chunk

Best for: Multi-topic documents where related chunks signal "the whole section is relevant"

3. Small-to-big chunking

Similar to sentence window, but with explicit parent-child hierarchy. Small chunks (100 tokens) link to medium chunks (500 tokens) which link to full documents (2000+ tokens). You choose the parent level at query time.

Example hierarchy:

Document: "2024 Annual Policy Updates" (3000 tokens)
└── Section: "Q3 Refund Policy Changes" (500 tokens)
└── Paragraph: "EU customers..." (100 tokens)

Query strategy:

For simple queries: Retrieve paragraph, return section
For complex queries: Retrieve paragraph, return full document

Best for: Hierarchical documents (reports, manuals, books)

Implementation: LangChain ParentDocumentRetriever

LangChain's ParentDocumentRetriever implements this pattern natively:

Step 1: Install dependencies

pip install langchain langchain-community chromadb openai

Step 2: Set up parent-child structure

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.docstore.document import Document

# Sample documents
documents = [
    Document(
        page_content="""Our refund policy changed in Q3 2024. Previously, EU customers had 14 days.
The new policy extends this to 30 days. US customers remain at 14 days.
This change reflects feedback from European support tickets.""",
        metadata={"source": "policy_updates_2024.pdf", "section": "Q3"}
    ),
]

# Child splitter (small chunks for retrieval)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)

# Parent splitter (large chunks for LLM context)
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# Vector store (stores child chunks)
vectorstore = Chroma(
    collection_name="child_chunks",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small")
)

# Document store (stores parent chunks)
store = InMemoryStore()

# Create retriever
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents (creates child/parent hierarchy)
retriever.add_documents(documents)

print(f"Indexed {len(documents)} parent documents with child chunks")

Step 3: Query with parent document retrieval

# Retrieve child chunks, return parent chunks
results = retriever.get_relevant_documents("What's the new EU refund policy?")

for i, doc in enumerate(results):
    print(f"Result {i+1}:")
    print(f"Content: {doc.page_content[:200]}...")
    print(f"Metadata: {doc.metadata}\n")

# Expected: Returns full parent chunk (500 tokens) even though child chunk (100 tokens) was matched

Implementation: LlamaIndex SentenceWindowNodeParser

LlamaIndex provides sentence window retrieval via SentenceWindowNodeParser:

from llama_index import Document, VectorStoreIndex
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.indices.postprocessor import MetadataReplacementPostProcessor
from llama_index.embeddings import OpenAIEmbedding

# Parse documents into sentence windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Include 3 sentences before/after match
    window_metadata_key="window",
    original_text_metadata_key="original_text"
)

documents = [Document(text="""Our refund policy changed in Q3 2024. Previously, EU customers had 14 days.
The new policy extends this to 30 days. US customers remain at 14 days.""")]

nodes = node_parser.get_nodes_from_documents(documents)

# Index sentence nodes
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
index = VectorStoreIndex(nodes, embed_model=embed_model)

# Query with window expansion
postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"  # Replace matched sentence with window
)

query_engine = index.as_query_engine(
    similarity_top_k=5,
    node_postprocessors=[postprocessor]
)

response = query_engine.query("What's the new EU refund policy?")
print(response.response)
# Returns sentence + 3-sentence window for context

Auto-merging retrieval with LlamaIndex

from llama_index.node_parser import HierarchicalNodeParser
from llama_index.indices.postprocessor import AutoMergingRetriever

# Create hierarchical chunks (3 levels)
node_parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128]  # Document → Section → Paragraph
)

nodes = node_parser.get_nodes_from_documents(documents)

# Build index
index = VectorStoreIndex(nodes, embed_model=embed_model)

# Auto-merging retriever
base_retriever = index.as_retriever(similarity_top_k=10)
retriever = AutoMergingRetriever(
    base_retriever,
    storage_context=index.storage_context,
    simple_ratio_thresh=0.5  # Merge if >50% of child chunks retrieved
)

# Query
results = retriever.retrieve("Explain the Q3 refund policy changes")
# Automatically merges child chunks to parent if multiple children match

Benchmarks: Parent retrieval on Thread Transfer data

We tested parent document retrieval on 250 customer support conversations (Thread Transfer bundles):

Method	Retrieval Precision	Answer Accuracy	Avg Latency
Small chunks only (100 tokens)	78%	62%	420ms
Large chunks only (500 tokens)	61%	71%	480ms
Parent document (100→500)	76%	79%	540ms
Sentence window (N=3)	74%	77%	510ms
Auto-merging (3 levels)	72%	81%	620ms

Key finding: Parent document retrieval gives you 75%+ precision with 77-81% accuracy—the best of both worlds. Auto-merging wins on accuracy but adds latency. Sentence window is fastest.

Production strategies: Choosing chunk sizes

For sentence window retrieval:

Window size 1-2: Technical docs with dense information
Window size 3-5: Policy documents, how-to guides
Window size 7-10: Narrative content (meeting notes, case studies)

For parent document retrieval:

Child 50-100 / Parent 300-500: FAQs, product specs
Child 100-200 / Parent 500-1000: Support articles, documentation
Child 200-300 / Parent 1000-2000: Long-form content (reports, research)

For auto-merging retrieval:

3-level hierarchy: Documents → Sections → Paragraphs (most common)
4-level hierarchy: Books → Chapters → Sections → Paragraphs
Merge threshold 0.3-0.5: Strict (requires more child chunks to merge)
Merge threshold 0.5-0.7: Lenient (merges more aggressively)

Thread Transfer integration: Conversation parent chunks

Thread Transfer bundles are perfect for parent document retrieval:

Parent chunk: Full conversation thread (500-2000 tokens)
Child chunks: Individual messages or decision points (50-200 tokens)
Query strategy: Retrieve relevant messages, return full conversation thread for context

Example structure:

# Export Thread Transfer bundle as JSON
bundle = {
    "thread_id": "support-ticket-1234",
    "messages": [
        {"speaker": "Customer", "text": "Billing issue with invoice #5678"},
        {"speaker": "Agent", "text": "Let me check that for you..."},
        {"speaker": "Agent", "text": "Found duplicate charge. Refunding now."},
    ],
    "metadata": {"resolved": True, "tags": ["billing", "refund"]}
}

# Child chunks = individual messages
# Parent chunk = full thread
# Query: "How was the billing issue resolved?"
# → Retrieves message 3, returns full thread

Combining parent retrieval with re-ranking

For maximum accuracy, chain parent document retrieval with re-ranking:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

# Parent document retriever
parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add Cohere re-ranker on parent chunks
compressor = CohereRerank(model="rerank-english-v3.0", top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=parent_retriever
)

# Query
docs = compression_retriever.get_relevant_documents(
    "What changed in the Q3 refund policy for EU customers?"
)
# 1. Retrieves child chunks
# 2. Returns parent chunks
# 3. Re-ranks parent chunks
# → Top 3 most relevant parent documents

Common pitfalls and fixes

Pitfall 1: Parent chunks too large

If parent chunks are 2000+ tokens, you'll hit LLM context limits when retrieving multiple parents. Fix: Keep parent chunks under 1500 tokens or reduce similarity_top_k.

Pitfall 2: Child chunks too small

Single-sentence child chunks (20-30 tokens) create noisy embeddings. Fix: Use 50-100 token minimums for child chunks. Sentence window retrieval handles this better than fixed-size splitting.

Pitfall 3: Not tuning merge thresholds

Auto-merging with wrong thresholds either merges too aggressively (returns entire documents) or never merges (returns fragments). Fix: Benchmark on 20-30 test queries. Adjust simple_ratio_thresh until you hit 75%+ accuracy.

2025 developments: Adaptive parent sizing

Emerging techniques use LLMs to dynamically choose parent chunk size based on query complexity:

Simple queries: Return small parents (300 tokens)
Complex queries: Return large parents (1000+ tokens)
Multi-hop queries: Return full documents

This adaptive approach improves latency on simple queries by 40% while maintaining accuracy on complex ones. Expect this to ship in LangChain 0.3+ and LlamaIndex 0.12+ in late 2025.

Hybrid approach: Parent retrieval + recursive retrieval

Combine parent document retrieval with recursive retrieval for ultimate context control:

First pass: Retrieve child chunks, return parent chunks
Second pass: If parent chunks reference other sections, recursively retrieve those
Synthesize: Combine all parent chunks into final context

This handles cross-document references while preserving granular retrieval precision.

Final recommendations

Add parent document retrieval to any RAG system dealing with multi-paragraph content. Start with LangChain's ParentDocumentRetriever (100-token children, 500-token parents). Measure answer accuracy on 50+ test queries—you should see 15-25% gains with minimal latency overhead.

For conversation data (Thread Transfer bundles), use message-level child chunks with full-thread parent chunks. For technical docs, use sentence window retrieval with N=3-5. For hierarchical documents (reports, books), use auto-merging with 3-level hierarchy.

Always combine with re-ranking for production. Parent retrieval fixes the chunking problem; re-ranking ensures you surface the best parent chunks. Together, they push RAG accuracy to 80-85% on complex queries.

Learn more: How it works · Why bundles beat raw thread history