Skip to main content

Thread Transfer

Agent Memory Architectures: From Demo to Production

Mem0 achieves 26% higher accuracy than OpenAI's memory with 91% latency reduction. Token consumption falls 90%. Memory is the gap between demo agents and production agents.

Jorgo Bardho

Founder, Thread Transfer

July 7, 202520 min read
AI agentsmemorycontext managementMem0production
Agent memory architecture diagram

Mem0 achieves 26% higher accuracy than OpenAI's memory on LOCOMO benchmarks. Latency drops 91% (1.44s vs 17.12s). Token consumption falls 90% (1.8K vs 26K per conversation). Yet Letta disputes the methodology and claims 74% accuracy with simpler architecture. Memory is the gap between demo agents and production agents—but which memory architecture ships vs which stays in research? This is the breakdown of short-term vs long-term memory, semantic vs episodic vs associative storage, benchmark reality, and context compression strategies that reduce token burn.

The memory hierarchy: why agents need more than context windows

LLMs have context windows. Agents need memory. The difference: context is ephemeral, memory persists. A GPT-4 session forgets everything when it closes. An agent with memory learns from every interaction and compounds knowledge over time.

The three memory types

Production agent systems implement three distinct memory layers, each solving different persistence needs:

1. Working memory (short-term)

  • Active conversation context, maintained during single session
  • Implemented via LLM context window or session state management
  • Discarded when session ends
  • Fast access (no external lookups), limited capacity (context window size)

2. Semantic memory (long-term)

  • General knowledge, facts, concepts extracted from interactions
  • Implemented via vector databases (Pinecone, Weaviate, Qdrant) + RAG retrieval
  • Persists across sessions, grows continuously
  • Retrieval based on semantic similarity, not exact match

3. Episodic/Associative memory (long-term)

  • Specific events, interactions, user preferences, entity relationships
  • Implemented via graph databases (Neo4j, Neptune) or specialized memory layers (Mem0, Letta)
  • Enables temporal reasoning ("What did user prefer last month vs this month?")
  • Supports multi-hop queries ("Who knows about this topic, who worked with them on related project?")

Why context windows aren't sufficient

Claude 3.7 has a 200K token context window. GPT-4 Turbo: 128K tokens. Gemini 1.5 Pro: 1M tokens. Sounds like enough, right? Wrong. Production context consumption reveals why:

ScenarioNaive context approachToken consumptionProblem
10-turn conversationPass full history every turn26K tokensLinear growth per turn, 90% redundant
Multi-day customer supportLoad all previous tickets in context80-200K tokensHits context limits, slow inference
Multi-agent collaborationShare full conversation between agents150K+ tokensExponential explosion as agents grow
Continuous learning agentConcatenate all past learningsContext limit reached in daysCannot scale beyond initial sessions

Memory architectures solve this by extracting and indexing only salient information. Mem0 reduces 26K tokens to 1.8K (90% reduction) by storing compressed facts instead of raw conversation history.

Semantic memory: RAG and vector databases

Retrieval-Augmented Generation (RAG) is the foundational pattern for semantic memory. Instead of stuffing all knowledge into the prompt, store it in a vector database and retrieve relevant chunks on-demand.

The RAG workflow

  1. Ingestion: Convert documents/knowledge into vector embeddings, store in vector DB
  2. Retrieval: When agent needs information, embed the query, find semantically similar chunks
  3. Augmentation: Inject retrieved chunks into LLM context
  4. Generation: LLM generates response using retrieved knowledge

Production advantages:

  • Knowledge base scales independently of context window size
  • Update knowledge without retraining model
  • Reduces hallucination (LLM grounds response in retrieved facts)
  • Transparent: can log what was retrieved and why

Vector database selection: what matters in production

Pinecone, Weaviate, Qdrant, FAISS, Milvus—there is no universal winner. Selection depends on architecture, scale, and constraints:

DatabaseBest forKey strengthTradeoff
PineconeFast prototyping, managed serviceZero ops, automatic scalingVendor lock-in, cost at scale
WeaviateProduction systems needing flexibilityHybrid search (vector + keyword), GraphQL APISteeper learning curve
QdrantPerformance-critical applicationsWritten in Rust, extremely fastSmaller ecosystem vs Pinecone
FAISSOn-premise deployments, researchFacebook-proven, self-hostedNo built-in persistence layer
MilvusLarge-scale enterprise deploymentsHandles billions of vectorsComplex setup, overkill for small scale

Agentic RAG: when agents plan retrieval

Traditional RAG: user query triggers single retrieval, LLM responds. Agentic RAG: agent plans multi-step retrieval, reasons about what information is needed, iteratively refines search.

Example workflow:

  1. User asks complex question requiring synthesis from multiple sources
  2. Agent breaks question into sub-questions
  3. Agent retrieves relevant chunks for each sub-question
  4. Agent synthesizes retrieved information
  5. If gaps remain, agent formulates follow-up retrievals
  6. Agent generates final response from aggregated context

Agentic RAG requires heavier infrastructure than basic RAG: low-latency vector DB (multiple sequential retrievals amplify latency), support for complex filters (agent needs to narrow search dynamically), and high-quality embeddings (poor embeddings derail multi-step reasoning).

Episodic and associative memory: graphs and structured storage

Semantic memory answers "What do I know about X?" Episodic memory answers "What happened when?" Associative memory answers "How are X and Y related?" Different storage patterns required.

Graph-based memory: Mem0g and GraphRAG

Graph databases model entities and relationships explicitly. Instead of embedding full documents, extract entities (people, projects, dates, decisions) and relationships (worked with, depends on, prefers).

Mem0g architecture:

  1. Extraction phase: LLM converts conversation into entity-relation triplets (User, prefers, async communication)
  2. Update phase: New triplets merged into existing graph with conflict detection
  3. Retrieval phase: Agent queries graph for relevant entities and relationships
  4. Reasoning phase: LLM synthesizes answer from graph context

When graph memory outperforms vector memory:

  • Temporal reasoning (preferences changing over time)
  • Multi-hop queries (who knows someone who worked on related topic)
  • Relationship-heavy domains (org charts, project dependencies, social networks)

LOCOMO benchmark: Mem0g achieves 2% higher score than base Mem0 on multi-hop and temporal questions. But graph extraction adds latency and LLM cost (entity extraction requires additional LLM calls).

Key-value memory: fast fact retrieval

Not all memory needs semantic search or graph traversal. Simple facts ("user timezone: PST", "API key: xyz123") are better stored in key-value stores (Redis, DynamoDB) for sub-millisecond retrieval.

Hybrid memory architectures combine all three:

  • Key-value for user preferences, API credentials, session state
  • Vector DB for semantic knowledge retrieval
  • Graph DB for entity relationships and temporal reasoning

Mem0 implements this hybrid approach: key-value for fast facts, vectors for semantic search, graph for complex relationships. Result: 91% latency reduction vs full-context approach.

Memory system benchmarks: the LOCOMO reality

Long-term Conversation Memory (LOCOMO) is the standard benchmark for evaluating agent memory across single-hop, multi-hop, temporal, and open-domain questions. But the published results are disputed.

Mem0's claims

SystemLOCOMO score (LLM-as-Judge)Improvement
OpenAI memory52.9%Baseline
Mem0 (base)66.9%+26% relative
Mem0g (graph)68.5%+29% relative

Efficiency metrics:

  • Latency (p95): 1.44s (Mem0) vs 17.12s (full context) = 91% reduction
  • Token consumption: 1.8K (Mem0) vs 26K (full context) = 90% reduction

Letta's counterclaim

Letta (formerly MemGPT) published that their simple agent achieves 74.0% on LOCOMO with GPT-4o mini and minimal prompt tuning—significantly above Mem0g's reported 68.5%. Letta also noted they could not reproduce Mem0's benchmarking methodology and received no response to clarification requests.

What this reveals:

  • Memory system benchmarking is immature (no standardized evaluation protocol)
  • LLM-as-Judge metrics are sensitive to prompt engineering and model choice
  • Production viability depends on more than accuracy (latency, cost, reliability matter)

Independent validation needed. Until then: test on your own data, measure latency and cost in your environment, don't trust published benchmarks alone.

Production memory architectures: case studies

Netflix: personalized content recommendations with episodic memory

Problem: recommend content based on viewing history, but context window can't hold months of watch history per user. Naive approach: embed recent watches. Result: misses long-term preferences.

Memory architecture:

  • Episodic memory: store every watch event (title, date, completion rate, genre)
  • Vector memory: embed user preference profiles extracted from watch history
  • Key-value: store user settings (preferred language, maturity filters)

Retrieval strategy:

  1. Load user preferences from key-value (instant)
  2. Retrieve semantically similar content from vector DB based on preference embeddings
  3. Filter by recent episodic memory (exclude recently watched, prioritize incomplete series)
  4. LLM generates personalized recommendation with reasoning

Result: recommendation quality improved without increasing context size. Agent maintains long-term preference understanding while adapting to recent behavior.

Lemonade: customer support with multi-session memory

Problem: customer support conversations span multiple sessions over weeks. Agent needs context from previous interactions without re-reading entire ticket history.

Memory architecture:

  • Semantic memory: knowledge base of policy details, FAQs, resolution procedures
  • Episodic memory: customer interaction history (issues reported, resolutions, sentiment)
  • Graph memory: entity relationships (customer, policy, claims, agents who helped)

Workflow:

  1. Customer starts new conversation
  2. Agent retrieves customer graph (previous issues, policy details, interaction patterns)
  3. Agent retrieves semantically similar past resolutions from vector DB
  4. Agent loads compressed summary of most recent interactions (not full transcripts)
  5. Agent responds with full context, updates episodic memory with new interaction

Result: agents provide continuity across sessions without context window explosion. Memory update after each interaction enables continuous learning.

Rocket Money: financial advice with temporal reasoning

Problem: provide spending advice based on changing financial patterns over time. Cannot fit 12 months of transaction data in context.

Memory architecture:

  • Key-value: current account balances, user goals, budget limits
  • Episodic memory: transaction history with temporal metadata
  • Graph memory: spending categories, merchant relationships, recurring payments

Temporal reasoning example:

User asks: "Am I spending more on dining than usual?"

  1. Agent queries episodic memory for dining transactions in last 30 days
  2. Agent queries episodic memory for dining transactions 30-60 days ago
  3. Agent compares totals, identifies trend
  4. Agent retrieves specific high-cost dining events from episodic memory
  5. Agent responds: "Yes, 22% increase. Drove by 3 dinners at [merchants] totaling $X"

Result: temporal reasoning without storing full transaction log in context. Graph memory enables "which merchants are new?" queries.

Memory management: what to remember, what to forget

Infinite memory creates infinite problems. Memory corruption, staleness, noise accumulation, retrieval latency, storage cost. Production memory systems need garbage collection.

Memory pruning strategies

1. Time-based decay

  • Age out old memories based on last access time
  • Works well for episodic memory (old interactions become less relevant)
  • Risk: loses important long-term context

2. Importance scoring

  • LLM assigns importance score to each memory on creation
  • Prune low-importance memories when storage limit reached
  • Risk: LLM might misjudge importance, delete critical context

3. Access-based retention

  • Keep memories that get retrieved frequently, prune rarely-accessed ones
  • Adaptive: naturally retains what's useful
  • Risk: cold-start problem (new memories compete with established ones)

4. Hierarchical summarization

  • Don't delete old memories, compress them into summaries
  • Store detailed memories for recent interactions, summaries for old ones
  • Best of both worlds: preserve long-term context, control storage growth

Conflict resolution in memory updates

New information contradicts existing memory. Example: user says "I prefer async communication" in January, then "I need real-time updates" in March. How does agent resolve?

Strategies:

StrategyApproachUse case
Last-write-winsReplace old memory with newSimple preferences that change over time
VersioningKeep both, timestamp eachTrack preference evolution, temporal reasoning
Contextual resolutionLLM decides which memory is relevant based on current contextPreferences that depend on situation (work vs personal)
Confidence weightingWeight memories by confidence, blend when conflictUncertain or probabilistic information

Mem0g uses contextual resolution: when inserting conflicting triplet, LLM determines whether to replace, version, or merge based on semantic analysis.

Context compression: the Thread Transfer approach

Memory systems store historical context. But agents still need to work with active conversation context. Naive approach: concatenate everything. Result: token explosion. Compression required.

Context bundle compression strategy

Thread Transfer bundles implement progressive compression:

  1. Initial turns (1-3): Preserve verbatim (agent needs full detail for immediate context)
  2. Recent turns (4-10): Extract decisions, action items, key findings. Omit conversational filler
  3. Historical turns (11+): Aggressive compression, only critical facts and outcomes

Compression rules by content type:

Content typeCompression approach
Decisions madePreserve in full with attribution and timestamp
Action itemsPreserve in full until completed, then summarize
Exploratory reasoningCompress to final conclusion only
Rejected optionsOmit entirely unless reasoning for rejection is important
Background contextExtract to separate knowledge base, reference by ID
User preferencesExtract to key-value memory, remove from conversation history

Measured compression efficiency

Real-world conversation, 15 turns, multi-agent collaboration:

  • Full history approach: 32,400 tokens
  • Naive summarization (LLM summarizes every 5 turns): 18,200 tokens
  • Thread Transfer progressive compression: 12,800 tokens (-60% vs full history)
  • Thread Transfer + memory extraction (preferences/facts to KV store): 8,400 tokens (-74% vs full history)

Compression preserves fidelity. Test: human reviewers evaluated whether compressed bundles contained sufficient context for next agent to continue work. Result: 94% fidelity score for Thread Transfer compression vs 97% for full history (3 percentage point drop for 74% token savings).

Latency optimization in memory retrieval

Memory retrieval adds latency to every agent action. Production systems need sub-second retrieval even as memory grows.

Latency breakdown in memory-augmented agents

OperationTypical latencyOptimization approach
Query embedding generation50-200msCache embeddings for common queries
Vector similarity search10-100msUse approximate nearest neighbor (ANN) algorithms
Graph traversal20-500msIndex hot paths, limit traversal depth
Key-value lookup1-10msKeep in-memory cache for frequently accessed keys
Memory reranking100-300msRerank only top-K candidates, not all results

Parallel retrieval strategy:

Instead of sequential lookups (key-value, then vector, then graph), issue all retrievals in parallel. Aggregate results when all complete. Reduces total latency from sum to max of individual retrievals.

Example:

  • Sequential: KV (5ms) + vector (80ms) + graph (120ms) = 205ms
  • Parallel: max(5ms, 80ms, 120ms) = 120ms (-41%)

Retrieval budget management

Agentic RAG can trigger dozens of retrievals per task. Each retrieval adds latency and cost. Production systems need retrieval budgets.

Budget enforcement:

  • Max retrievals per agent action: 5-10 typical
  • Max tokens retrieved per action: 2K-5K typical
  • Timeout per retrieval: 500ms-1s typical

When budget exhausted, agent must work with what it has or fail gracefully. Better than infinite retrieval loops that timeout after 30 seconds.

Implementation checklist: shipping memory-augmented agents

Architecture decisions

  • Identify which memory types are needed (semantic, episodic, associative, or hybrid)
  • Select storage backends (vector DB, graph DB, key-value) based on access patterns
  • Define memory extraction rules (what gets stored from each interaction)
  • Establish memory pruning policy (time-based, importance-based, or hybrid)

Performance requirements

  • Set latency budget for memory retrieval (p95 target)
  • Define retrieval limits (max retrievals per action, max tokens retrieved)
  • Implement parallel retrieval to minimize total latency
  • Cache frequent queries (user preferences, common facts)

Quality assurance

  • Measure retrieval precision (% of retrieved memories that are relevant)
  • Measure retrieval recall (% of relevant memories that are retrieved)
  • Test conflict resolution (how does system handle contradictory information)
  • Validate compression fidelity (does compressed context preserve critical information)

Operational requirements

  • Log all memory operations (insertions, updates, retrievals) for debugging
  • Monitor memory growth rate (are you growing unbounded?)
  • Implement memory export/import for backup and portability
  • Add manual memory editing interface for correcting errors

Common failure modes and mitigations

Memory corruption

Symptom: Agent remembers incorrect information, hallucinates based on corrupted memory.

Causes:

  • LLM extracts wrong entities or relationships during memory creation
  • Conflict resolution merges incompatible information
  • User provides false information that gets stored as fact

Mitigations:

  • Validate extracted memories before storage (confidence thresholds)
  • Version all memories with timestamps and sources
  • Implement manual correction interface for humans to fix errors
  • Add "memory confidence" metadata, weight low-confidence memories less during retrieval

Retrieval irrelevance

Symptom: Retrieved memories are not relevant to current task, agent gets distracted.

Causes:

  • Poor embedding quality (semantically unrelated content scores high)
  • Query formulation problem (agent asks wrong question)
  • Memory index pollution (too much noise in storage)

Mitigations:

  • Use reranking after initial retrieval (cross-encoder models improve relevance)
  • Implement retrieval feedback loop (agent evaluates relevance, refines query)
  • Prune low-quality memories that are never accessed
  • Add metadata filters to narrow retrieval (date ranges, entity types, source)

Latency explosion

Symptom: Agent actions take 5-10+ seconds, primarily spent on memory operations.

Causes:

  • Too many sequential retrievals in critical path
  • Graph traversals with unbounded depth
  • No caching of frequently accessed memories

Mitigations:

  • Parallelize independent retrievals (vector + graph + KV in parallel)
  • Limit graph traversal depth (2-3 hops max)
  • Cache hot memories in-memory (user preferences, frequently accessed facts)
  • Set hard timeouts on all memory operations (fail fast if retrieval is slow)

Future directions: what's next for agent memory

Multi-agent shared memory

Current systems: each agent has isolated memory. Future: agents share memory pool, contribute to collective knowledge. Challenges: conflict resolution, attribution, access control.

Continual learning without retraining

Current systems: memory provides context, but LLM weights are frozen. Future: memory updates influence model behavior without full retraining. Requires techniques like parameter-efficient fine-tuning (PEFT) integrated with memory layer.

Privacy-preserving memory

Current systems: store user data in centralized memory. Future: federated memory (user data stays local, encrypted retrieval), differential privacy in memory storage. Critical for enterprise and regulated industries.

Memory portability and standards

Current systems: vendor-specific memory formats. Future: standardized memory interchange format, enabling agent memory to transfer between systems. Thread Transfer bundles are a step in this direction—portable, deterministic context packages.

Key takeaways

  • Memory is the gap between demo agents and production agents. Context windows are insufficient for multi-session, long-term agent interactions.
  • Three memory types: working (short-term context), semantic (knowledge retrieval via RAG), episodic/associative (events and relationships via graphs).
  • Mem0 achieves 26% higher accuracy than OpenAI memory on LOCOMO, with 91% latency reduction and 90% token savings. But methodology is disputed—test on your own data.
  • Memory management is critical: garbage collection, conflict resolution, compression. Infinite memory creates infinite problems.
  • Production memory systems need latency optimization: parallel retrieval, caching, retrieval budgets. Sub-second retrieval is non-negotiable.
  • Thread Transfer bundles compress multi-turn conversations by 60-74% while preserving 94% fidelity. Progressive compression + memory extraction enables scalable context management.