Thread Transfer
Context Injection Patterns for LLM Applications
Context ordering matters. System prompts first, retrieved context middle, user query last. But it's more nuanced than that—here's the production playbook.
Jorgo Bardho
Founder, Thread Transfer
Traditional RAG systems precomputed what chunks needed to be put into context, injected them, and asked the system to reason about the chunks. Search was a one-shot operation. Modern context injection has evolved: dynamic retrieval, parametric knowledge injection, and faceted search are reshaping how we architect RAG systems in 2025.
The Evolution of Context Injection
RAG combines information retrieval with generation: a query is embedded, relevant chunks are fetched from a knowledge store (often a vector database), and those snippets are injected into the model prompt before generation. This lets models reflect authoritative, up-to-date sources without expensive retraining and reduces hallucinations by grounding outputs in verifiable context.
But how you inject context matters as much as what you inject. The shift from thinking about retrieval to thinking about context orchestration represents a fundamental change in how we architect AI systems.
Pattern 1: Direct Context Injection
The baseline strategy for context management should be to put all required context directly in the LLM context window. Frontier LLMs are excellent at managing and navigating large volumes of structured context.
When to Use
- Short to medium-sized documents (under 50K tokens)
- Stable, frequently-accessed context (documentation, schemas)
- When retrieval latency would hurt UX
- Context that needs to be available for every request
Implementation
<system_context>
<user_preferences>
<timezone>America/Los_Angeles</timezone>
<notification_settings>email_only</notification_settings>
</user_preferences>
<api_schema>
<!-- Full API specification -->
</api_schema>
<company_guidelines>
<!-- Style guide, policies -->
</company_guidelines>
</system_context>
<user_query>
{query}
</user_query>Best Practices
When context is appropriately separated using XML-style tags, the LLM typically handles as much context as you can provide. Use clear, hierarchical structure to help the model navigate efficiently.
- Namespace context sections: Use XML tags or JSON keys
- Order by relevance: Most critical context first
- Include metadata: Timestamps, sources, confidence scores
- Monitor token usage: Track what actually gets referenced
Pattern 2: Retrieval-Augmented Generation (RAG)
The standard RAG pattern: embed the query, retrieve top-k similar chunks from a vector database, inject them into the prompt.
When to Use
- Large knowledge bases (millions of documents)
- Frequently updated content
- When only a small subset of context is relevant per query
- Cost-sensitive applications (avoid loading full corpus)
Implementation
async function rag_query(query: string) {
// 1. Embed the query
const query_embedding = await embed(query)
// 2. Retrieve top-k similar chunks
const results = await vector_db.query({
vector: query_embedding,
top_k: 5,
filter: { source: "docs", timestamp: { $gte: "2025-01-01" } }
})
// 3. Inject into prompt
const context = results.map(r => r.text).join("\n\n---\n\n")
const prompt = `Context:\n${context}\n\nQuery: ${query}\n\nAnswer:`
// 4. Generate
return await llm.generate(prompt)
}Key Metrics
- Retrieval precision: % of retrieved chunks actually relevant
- Retrieval recall: % of relevant chunks successfully retrieved
- Answer faithfulness: Does the answer stay grounded in context?
- Hallucination rate: Vectara found 1-30% hallucination rates in RAG systems
Pattern 3: Dynamic RAG (Adaptive Retrieval)
Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs.
How It Works
Instead of one-shot retrieval, the model decides during generation whether it needs more information. This is implemented through:
- Function calling to trigger retrieval mid-generation
- Agentic loops that iterate between reasoning and retrieval
- Chain-of-thought prompting to surface information gaps
Implementation
const tools = [
{
name: "search_documentation",
description: "Search technical docs for specific information",
parameters: {
query: "string",
doc_type: "api | guide | reference"
}
}
]
async function dynamic_rag(user_query: string) {
let messages = [{ role: "user", content: user_query }]
while (true) {
const response = await llm.generate({
messages,
tools,
tool_choice: "auto"
})
if (response.tool_calls) {
// Model requested more context
for (const call of response.tool_calls) {
const result = await execute_tool(call)
messages.push({
role: "tool",
tool_call_id: call.id,
content: result
})
}
} else {
// Model has enough context to answer
return response.content
}
}
}Benefits
- Retrieves only what's needed (lower token costs)
- Adapts to complex, multi-step queries
- Reduces attention dilution from irrelevant context
- Improves answer quality for exploratory questions
Pattern 4: Parametric RAG (DyPRAG)
Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness.
How It Works
DyPRAG (Tan et al., 2025) introduces a dynamic parameter translator that generates document-specific parametric representations on-the-fly, conditioned on the retrieved document's semantic embedding.
Instead of injecting text into the prompt, the system injects knowledge directly into model parameters via adapter layers. This approach:
- Reduces context window usage (knowledge stored in parameters, not tokens)
- Improves generation speed (no long prompts to process)
- Enables better knowledge integration (parameters vs concatenation)
Status
Parametric RAG is emerging research (2025) and not yet production-ready for most teams. Early benchmarks show 10-15% accuracy gains with 40% lower latency, but implementation complexity is high.
Pattern 5: Faceted Search + Context Engineering
Modern context engineering focuses on designing tool responses and interaction patterns that teach agents how to navigate data landscapes, not just consume chunks.
Key Concept: Faceted Search
Return metadata aggregations (counts, categories) alongside results so agents can refine queries strategically.
{
"results": [
{ "id": 1, "title": "API Authentication Guide", "category": "security" },
{ "id": 2, "title": "OAuth Setup", "category": "security" }
],
"facets": {
"category": { "security": 45, "getting-started": 12, "advanced": 8 },
"last_updated": { "2025": 30, "2024": 25, "2023": 10 },
"doc_type": { "guide": 40, "reference": 15, "tutorial": 10 }
},
"total": 65,
"query_guidance": "Try filtering by category:'security' AND last_updated:'2025' to narrow results"
}Benefits
Agents can reason about the data landscape before committing to retrieval. This reduces failed searches and improves context quality.
Pattern 6: Hierarchical Context Injection
Inject context at multiple levels of granularity: summaries first, then details on-demand.
Implementation
<context_hierarchy>
<level_1_summary>
<!-- 500-token executive summary of all relevant docs -->
</level_1_summary>
<level_2_sections>
<section id="auth" summary="Authentication flows">
<!-- Section-level details, 2K tokens -->
</section>
<section id="errors" summary="Error handling patterns">
<!-- Section-level details, 2K tokens -->
</section>
</level_2_sections>
<level_3_full_docs available="true">
<!-- Full docs available via function call if needed -->
</level_3_full_docs>
</context_hierarchy>When to Use
- Large, structured knowledge bases
- When most queries need only high-level context
- To reduce token usage while maintaining access to detail
Pattern 7: Cache-Augmented Generation (CAG)
CAG pre-computes documents and caches the results as part of a prompt so that those documents' context is available to the LLM when it generates completions.
How It Works
// With Anthropic prompt caching
const response = await anthropic.messages.create({
model: "claude-opus-4-5",
system: [
{
type: "text",
text: "You are a helpful assistant...",
cache_control: { type: "ephemeral" }
},
{
type: "text",
text: large_documentation_string, // Cached for 5 min
cache_control: { type: "ephemeral" }
}
],
messages: [{ role: "user", content: user_query }]
})Benefits
- No retrieval latency (context pre-loaded)
- 90% cost reduction on cached tokens (Anthropic: $0.30/MTok vs $3/MTok)
- Perfect for frequently-accessed, stable context
- All documents available in all prompts (no retrieval precision issues)
Tradeoffs
- Limited by context window size
- Cache TTL means frequent refreshes for updated content
- Works best with static reference materials
Security: Context Injection Attacks
Most vector DBs don't enforce fine-grained access control by default. If your application pulls from multiple namespaces, a simple prompt injection can trigger cross-namespace retrieval unless isolation is enforced explicitly.
Mitigation Strategies
- Namespace isolation: Separate vector collections per tenant/user
- Metadata filtering: Enforce user_id filters on all queries
- Input sanitization: Strip retrieval directives from user queries
- Output validation: Check generated responses for leaked context
// Enforce user-scoped retrieval
const results = await vector_db.query({
vector: query_embedding,
filter: {
user_id: authenticated_user.id, // ALWAYS filter by user
status: "published"
},
namespace: `user_${authenticated_user.id}` // Namespace isolation
})Production Architecture
Production RAG systems typically combine multiple patterns:
- Direct injection for system instructions and user preferences
- CAG for frequently-accessed static docs (API schemas, style guides)
- Standard RAG for large, searchable knowledge bases
- Dynamic RAG for complex queries requiring multi-step reasoning
Example Architecture
async function production_query(query: string, user: User) {
// Layer 1: Direct injection (always present)
const system_context = build_system_context(user)
// Layer 2: Cached static context (CAG)
const cached_docs = load_cached_documentation()
// Layer 3: Retrieved context (RAG)
const dynamic_context = await retrieve_relevant_chunks(query, user.id)
// Layer 4: Compose prompt
const prompt = `
<system>${system_context}</system>
<cached_docs>${cached_docs}</cached_docs>
<retrieved>${dynamic_context}</retrieved>
<query>${query}</query>
`
// Layer 5: Dynamic retrieval if needed
return await dynamic_rag(prompt, user, {
allow_additional_retrieval: true,
max_iterations: 3
})
}Measuring Context Injection Quality
Track these metrics to optimize your injection patterns:
| Metric | Target | Measurement |
|---|---|---|
| Context utilization | >60% | % of injected context referenced in output |
| Retrieval precision | >80% | % of retrieved chunks actually relevant |
| Retrieval recall | >90% | % of relevant chunks successfully retrieved |
| Answer faithfulness | >95% | % of answers grounded in provided context |
| Latency (P95) | <2s | Time from query to response |
| Cost per query | <$0.05 | Embedding + retrieval + generation costs |
Thread Transfer's Context Injection Pattern
Thread Transfer uses hybrid context bundling: we pre-process conversation threads into structured, semantic bundles that combine:
- Direct injection: Key decisions, action items, stakeholders
- Hierarchical context: Summary + details on-demand
- Metadata tagging: Timestamps, participants, topics for retrieval
This delivers 40-80% token savings compared to raw thread injection while maintaining higher accuracy than pure RAG retrieval of message history.
Best Practices Summary
- Start with direct injection for critical, stable context
- Add caching for frequently-accessed static materials
- Use standard RAG for large, searchable knowledge bases
- Layer in dynamic retrieval for complex queries
- Structure context with XML tags or JSON for model navigation
- Monitor utilization to identify unused context
- Enforce security with namespace isolation and metadata filtering
- Measure quality with faithfulness, precision, and recall metrics
The Future: Context Orchestration
Throwing more documents into a context window doesn't improve performance linearly. In many cases, it degrades performance due to "attention dilution" where the model's attention focus spreads too thin.
The future belongs to systems that orchestrate context: dynamically deciding what to inject, when to inject it, and how to structure it for maximum model performance. Context engineering is now a first-class architectural concern alongside storage and compute.
Learn more: How it works · Why bundles beat raw thread history