Skip to main content

Thread Transfer

Context Injection Patterns for LLM Applications

Context ordering matters. System prompts first, retrieved context middle, user query last. But it's more nuanced than that—here's the production playbook.

Jorgo Bardho

Founder, Thread Transfer

July 13, 202515 min read
context injectionprompt engineeringRAGsystem prompts
Context injection pattern diagram

Traditional RAG systems precomputed what chunks needed to be put into context, injected them, and asked the system to reason about the chunks. Search was a one-shot operation. Modern context injection has evolved: dynamic retrieval, parametric knowledge injection, and faceted search are reshaping how we architect RAG systems in 2025.

The Evolution of Context Injection

RAG combines information retrieval with generation: a query is embedded, relevant chunks are fetched from a knowledge store (often a vector database), and those snippets are injected into the model prompt before generation. This lets models reflect authoritative, up-to-date sources without expensive retraining and reduces hallucinations by grounding outputs in verifiable context.

But how you inject context matters as much as what you inject. The shift from thinking about retrieval to thinking about context orchestration represents a fundamental change in how we architect AI systems.

Pattern 1: Direct Context Injection

The baseline strategy for context management should be to put all required context directly in the LLM context window. Frontier LLMs are excellent at managing and navigating large volumes of structured context.

When to Use

  • Short to medium-sized documents (under 50K tokens)
  • Stable, frequently-accessed context (documentation, schemas)
  • When retrieval latency would hurt UX
  • Context that needs to be available for every request

Implementation

<system_context>
  <user_preferences>
    <timezone>America/Los_Angeles</timezone>
    <notification_settings>email_only</notification_settings>
  </user_preferences>

  <api_schema>
    <!-- Full API specification -->
  </api_schema>

  <company_guidelines>
    <!-- Style guide, policies -->
  </company_guidelines>
</system_context>

<user_query>
  {query}
</user_query>

Best Practices

When context is appropriately separated using XML-style tags, the LLM typically handles as much context as you can provide. Use clear, hierarchical structure to help the model navigate efficiently.

  • Namespace context sections: Use XML tags or JSON keys
  • Order by relevance: Most critical context first
  • Include metadata: Timestamps, sources, confidence scores
  • Monitor token usage: Track what actually gets referenced

Pattern 2: Retrieval-Augmented Generation (RAG)

The standard RAG pattern: embed the query, retrieve top-k similar chunks from a vector database, inject them into the prompt.

When to Use

  • Large knowledge bases (millions of documents)
  • Frequently updated content
  • When only a small subset of context is relevant per query
  • Cost-sensitive applications (avoid loading full corpus)

Implementation

async function rag_query(query: string) {
  // 1. Embed the query
  const query_embedding = await embed(query)

  // 2. Retrieve top-k similar chunks
  const results = await vector_db.query({
    vector: query_embedding,
    top_k: 5,
    filter: { source: "docs", timestamp: { $gte: "2025-01-01" } }
  })

  // 3. Inject into prompt
  const context = results.map(r => r.text).join("\n\n---\n\n")
  const prompt = `Context:\n${context}\n\nQuery: ${query}\n\nAnswer:`

  // 4. Generate
  return await llm.generate(prompt)
}

Key Metrics

  • Retrieval precision: % of retrieved chunks actually relevant
  • Retrieval recall: % of relevant chunks successfully retrieved
  • Answer faithfulness: Does the answer stay grounded in context?
  • Hallucination rate: Vectara found 1-30% hallucination rates in RAG systems

Pattern 3: Dynamic RAG (Adaptive Retrieval)

Dynamic RAG adaptively determines when and what to retrieve during the LLM's generation process, enabling real-time adaptation to the LLM's evolving information needs.

How It Works

Instead of one-shot retrieval, the model decides during generation whether it needs more information. This is implemented through:

  • Function calling to trigger retrieval mid-generation
  • Agentic loops that iterate between reasoning and retrieval
  • Chain-of-thought prompting to surface information gaps

Implementation

const tools = [
  {
    name: "search_documentation",
    description: "Search technical docs for specific information",
    parameters: {
      query: "string",
      doc_type: "api | guide | reference"
    }
  }
]

async function dynamic_rag(user_query: string) {
  let messages = [{ role: "user", content: user_query }]

  while (true) {
    const response = await llm.generate({
      messages,
      tools,
      tool_choice: "auto"
    })

    if (response.tool_calls) {
      // Model requested more context
      for (const call of response.tool_calls) {
        const result = await execute_tool(call)
        messages.push({
          role: "tool",
          tool_call_id: call.id,
          content: result
        })
      }
    } else {
      // Model has enough context to answer
      return response.content
    }
  }
}

Benefits

  • Retrieves only what's needed (lower token costs)
  • Adapts to complex, multi-step queries
  • Reduces attention dilution from irrelevant context
  • Improves answer quality for exploratory questions

Pattern 4: Parametric RAG (DyPRAG)

Parametric RAG rethinks how retrieved knowledge should be injected into LLMs, transitioning from input-level to parameter-level knowledge injection for enhanced efficiency and effectiveness.

How It Works

DyPRAG (Tan et al., 2025) introduces a dynamic parameter translator that generates document-specific parametric representations on-the-fly, conditioned on the retrieved document's semantic embedding.

Instead of injecting text into the prompt, the system injects knowledge directly into model parameters via adapter layers. This approach:

  • Reduces context window usage (knowledge stored in parameters, not tokens)
  • Improves generation speed (no long prompts to process)
  • Enables better knowledge integration (parameters vs concatenation)

Status

Parametric RAG is emerging research (2025) and not yet production-ready for most teams. Early benchmarks show 10-15% accuracy gains with 40% lower latency, but implementation complexity is high.

Pattern 5: Faceted Search + Context Engineering

Modern context engineering focuses on designing tool responses and interaction patterns that teach agents how to navigate data landscapes, not just consume chunks.

Key Concept: Faceted Search

Return metadata aggregations (counts, categories) alongside results so agents can refine queries strategically.

{
  "results": [
    { "id": 1, "title": "API Authentication Guide", "category": "security" },
    { "id": 2, "title": "OAuth Setup", "category": "security" }
  ],
  "facets": {
    "category": { "security": 45, "getting-started": 12, "advanced": 8 },
    "last_updated": { "2025": 30, "2024": 25, "2023": 10 },
    "doc_type": { "guide": 40, "reference": 15, "tutorial": 10 }
  },
  "total": 65,
  "query_guidance": "Try filtering by category:'security' AND last_updated:'2025' to narrow results"
}

Benefits

Agents can reason about the data landscape before committing to retrieval. This reduces failed searches and improves context quality.

Pattern 6: Hierarchical Context Injection

Inject context at multiple levels of granularity: summaries first, then details on-demand.

Implementation

<context_hierarchy>
  <level_1_summary>
    <!-- 500-token executive summary of all relevant docs -->
  </level_1_summary>

  <level_2_sections>
    <section id="auth" summary="Authentication flows">
      <!-- Section-level details, 2K tokens -->
    </section>
    <section id="errors" summary="Error handling patterns">
      <!-- Section-level details, 2K tokens -->
    </section>
  </level_2_sections>

  <level_3_full_docs available="true">
    <!-- Full docs available via function call if needed -->
  </level_3_full_docs>
</context_hierarchy>

When to Use

  • Large, structured knowledge bases
  • When most queries need only high-level context
  • To reduce token usage while maintaining access to detail

Pattern 7: Cache-Augmented Generation (CAG)

CAG pre-computes documents and caches the results as part of a prompt so that those documents' context is available to the LLM when it generates completions.

How It Works

// With Anthropic prompt caching
const response = await anthropic.messages.create({
  model: "claude-opus-4-5",
  system: [
    {
      type: "text",
      text: "You are a helpful assistant...",
      cache_control: { type: "ephemeral" }
    },
    {
      type: "text",
      text: large_documentation_string,  // Cached for 5 min
      cache_control: { type: "ephemeral" }
    }
  ],
  messages: [{ role: "user", content: user_query }]
})

Benefits

  • No retrieval latency (context pre-loaded)
  • 90% cost reduction on cached tokens (Anthropic: $0.30/MTok vs $3/MTok)
  • Perfect for frequently-accessed, stable context
  • All documents available in all prompts (no retrieval precision issues)

Tradeoffs

  • Limited by context window size
  • Cache TTL means frequent refreshes for updated content
  • Works best with static reference materials

Security: Context Injection Attacks

Most vector DBs don't enforce fine-grained access control by default. If your application pulls from multiple namespaces, a simple prompt injection can trigger cross-namespace retrieval unless isolation is enforced explicitly.

Mitigation Strategies

  • Namespace isolation: Separate vector collections per tenant/user
  • Metadata filtering: Enforce user_id filters on all queries
  • Input sanitization: Strip retrieval directives from user queries
  • Output validation: Check generated responses for leaked context
// Enforce user-scoped retrieval
const results = await vector_db.query({
  vector: query_embedding,
  filter: {
    user_id: authenticated_user.id,  // ALWAYS filter by user
    status: "published"
  },
  namespace: `user_${authenticated_user.id}`  // Namespace isolation
})

Production Architecture

Production RAG systems typically combine multiple patterns:

  1. Direct injection for system instructions and user preferences
  2. CAG for frequently-accessed static docs (API schemas, style guides)
  3. Standard RAG for large, searchable knowledge bases
  4. Dynamic RAG for complex queries requiring multi-step reasoning

Example Architecture

async function production_query(query: string, user: User) {
  // Layer 1: Direct injection (always present)
  const system_context = build_system_context(user)

  // Layer 2: Cached static context (CAG)
  const cached_docs = load_cached_documentation()

  // Layer 3: Retrieved context (RAG)
  const dynamic_context = await retrieve_relevant_chunks(query, user.id)

  // Layer 4: Compose prompt
  const prompt = `
    <system>${system_context}</system>
    <cached_docs>${cached_docs}</cached_docs>
    <retrieved>${dynamic_context}</retrieved>
    <query>${query}</query>
  `

  // Layer 5: Dynamic retrieval if needed
  return await dynamic_rag(prompt, user, {
    allow_additional_retrieval: true,
    max_iterations: 3
  })
}

Measuring Context Injection Quality

Track these metrics to optimize your injection patterns:

MetricTargetMeasurement
Context utilization>60%% of injected context referenced in output
Retrieval precision>80%% of retrieved chunks actually relevant
Retrieval recall>90%% of relevant chunks successfully retrieved
Answer faithfulness>95%% of answers grounded in provided context
Latency (P95)<2sTime from query to response
Cost per query<$0.05Embedding + retrieval + generation costs

Thread Transfer's Context Injection Pattern

Thread Transfer uses hybrid context bundling: we pre-process conversation threads into structured, semantic bundles that combine:

  • Direct injection: Key decisions, action items, stakeholders
  • Hierarchical context: Summary + details on-demand
  • Metadata tagging: Timestamps, participants, topics for retrieval

This delivers 40-80% token savings compared to raw thread injection while maintaining higher accuracy than pure RAG retrieval of message history.

Best Practices Summary

  1. Start with direct injection for critical, stable context
  2. Add caching for frequently-accessed static materials
  3. Use standard RAG for large, searchable knowledge bases
  4. Layer in dynamic retrieval for complex queries
  5. Structure context with XML tags or JSON for model navigation
  6. Monitor utilization to identify unused context
  7. Enforce security with namespace isolation and metadata filtering
  8. Measure quality with faithfulness, precision, and recall metrics

The Future: Context Orchestration

Throwing more documents into a context window doesn't improve performance linearly. In many cases, it degrades performance due to "attention dilution" where the model's attention focus spreads too thin.

The future belongs to systems that orchestrate context: dynamically deciding what to inject, when to inject it, and how to structure it for maximum model performance. Context engineering is now a first-class architectural concern alongside storage and compute.