Thread Transfer

Agent Memory Architectures: From Demo to Production

Mem0 achieves 26% higher accuracy than OpenAI's memory with 91% latency reduction. Token consumption falls 90%. Memory is the gap between demo agents and production agents.

Jorgo Bardho

Founder, Thread Transfer

July 7, 2025•20 min read

AI agentsmemorycontext managementMem0production

Mem0 achieves 26% higher accuracy than OpenAI's memory on LOCOMO benchmarks. Latency drops 91% (1.44s vs 17.12s). Token consumption falls 90% (1.8K vs 26K per conversation). Yet Letta disputes the methodology and claims 74% accuracy with simpler architecture. Memory is the gap between demo agents and production agents—but which memory architecture ships vs which stays in research? This is the breakdown of short-term vs long-term memory, semantic vs episodic vs associative storage, benchmark reality, and context compression strategies that reduce token burn.

The memory hierarchy: why agents need more than context windows

LLMs have context windows. Agents need memory. The difference: context is ephemeral, memory persists. A GPT-4 session forgets everything when it closes. An agent with memory learns from every interaction and compounds knowledge over time.

The three memory types

Production agent systems implement three distinct memory layers, each solving different persistence needs:

1. Working memory (short-term)

Active conversation context, maintained during single session
Implemented via LLM context window or session state management
Discarded when session ends
Fast access (no external lookups), limited capacity (context window size)

2. Semantic memory (long-term)

General knowledge, facts, concepts extracted from interactions
Implemented via vector databases (Pinecone, Weaviate, Qdrant) + RAG retrieval
Persists across sessions, grows continuously
Retrieval based on semantic similarity, not exact match

3. Episodic/Associative memory (long-term)

Specific events, interactions, user preferences, entity relationships
Implemented via graph databases (Neo4j, Neptune) or specialized memory layers (Mem0, Letta)
Enables temporal reasoning ("What did user prefer last month vs this month?")
Supports multi-hop queries ("Who knows about this topic, who worked with them on related project?")

Why context windows aren't sufficient

Claude 3.7 has a 200K token context window. GPT-4 Turbo: 128K tokens. Gemini 1.5 Pro: 1M tokens. Sounds like enough, right? Wrong. Production context consumption reveals why:

Scenario	Naive context approach	Token consumption	Problem
10-turn conversation	Pass full history every turn	26K tokens	Linear growth per turn, 90% redundant
Multi-day customer support	Load all previous tickets in context	80-200K tokens	Hits context limits, slow inference
Multi-agent collaboration	Share full conversation between agents	150K+ tokens	Exponential explosion as agents grow
Continuous learning agent	Concatenate all past learnings	Context limit reached in days	Cannot scale beyond initial sessions

Memory architectures solve this by extracting and indexing only salient information. Mem0 reduces 26K tokens to 1.8K (90% reduction) by storing compressed facts instead of raw conversation history.

Semantic memory: RAG and vector databases

Retrieval-Augmented Generation (RAG) is the foundational pattern for semantic memory. Instead of stuffing all knowledge into the prompt, store it in a vector database and retrieve relevant chunks on-demand.

The RAG workflow

Ingestion: Convert documents/knowledge into vector embeddings, store in vector DB
Retrieval: When agent needs information, embed the query, find semantically similar chunks
Augmentation: Inject retrieved chunks into LLM context
Generation: LLM generates response using retrieved knowledge

Production advantages:

Knowledge base scales independently of context window size
Update knowledge without retraining model
Reduces hallucination (LLM grounds response in retrieved facts)
Transparent: can log what was retrieved and why

Vector database selection: what matters in production

Pinecone, Weaviate, Qdrant, FAISS, Milvus—there is no universal winner. Selection depends on architecture, scale, and constraints:

Database	Best for	Key strength	Tradeoff
Pinecone	Fast prototyping, managed service	Zero ops, automatic scaling	Vendor lock-in, cost at scale
Weaviate	Production systems needing flexibility	Hybrid search (vector + keyword), GraphQL API	Steeper learning curve
Qdrant	Performance-critical applications	Written in Rust, extremely fast	Smaller ecosystem vs Pinecone
FAISS	On-premise deployments, research	Facebook-proven, self-hosted	No built-in persistence layer
Milvus	Large-scale enterprise deployments	Handles billions of vectors	Complex setup, overkill for small scale

Agentic RAG: when agents plan retrieval

Traditional RAG: user query triggers single retrieval, LLM responds. Agentic RAG: agent plans multi-step retrieval, reasons about what information is needed, iteratively refines search.

Example workflow:

User asks complex question requiring synthesis from multiple sources
Agent breaks question into sub-questions
Agent retrieves relevant chunks for each sub-question
Agent synthesizes retrieved information
If gaps remain, agent formulates follow-up retrievals
Agent generates final response from aggregated context

Agentic RAG requires heavier infrastructure than basic RAG: low-latency vector DB (multiple sequential retrievals amplify latency), support for complex filters (agent needs to narrow search dynamically), and high-quality embeddings (poor embeddings derail multi-step reasoning).

Episodic and associative memory: graphs and structured storage

Semantic memory answers "What do I know about X?" Episodic memory answers "What happened when?" Associative memory answers "How are X and Y related?" Different storage patterns required.

Graph-based memory: Mem0g and GraphRAG

Graph databases model entities and relationships explicitly. Instead of embedding full documents, extract entities (people, projects, dates, decisions) and relationships (worked with, depends on, prefers).

Mem0g architecture:

Extraction phase: LLM converts conversation into entity-relation triplets (User, prefers, async communication)
Update phase: New triplets merged into existing graph with conflict detection
Retrieval phase: Agent queries graph for relevant entities and relationships
Reasoning phase: LLM synthesizes answer from graph context

When graph memory outperforms vector memory:

Temporal reasoning (preferences changing over time)
Multi-hop queries (who knows someone who worked on related topic)
Relationship-heavy domains (org charts, project dependencies, social networks)

LOCOMO benchmark: Mem0g achieves 2% higher score than base Mem0 on multi-hop and temporal questions. But graph extraction adds latency and LLM cost (entity extraction requires additional LLM calls).

Key-value memory: fast fact retrieval

Not all memory needs semantic search or graph traversal. Simple facts ("user timezone: PST", "API key: xyz123") are better stored in key-value stores (Redis, DynamoDB) for sub-millisecond retrieval.

Hybrid memory architectures combine all three:

Key-value for user preferences, API credentials, session state
Vector DB for semantic knowledge retrieval
Graph DB for entity relationships and temporal reasoning

Mem0 implements this hybrid approach: key-value for fast facts, vectors for semantic search, graph for complex relationships. Result: 91% latency reduction vs full-context approach.

Memory system benchmarks: the LOCOMO reality

Long-term Conversation Memory (LOCOMO) is the standard benchmark for evaluating agent memory across single-hop, multi-hop, temporal, and open-domain questions. But the published results are disputed.

Mem0's claims

System	LOCOMO score (LLM-as-Judge)	Improvement
OpenAI memory	52.9%	Baseline
Mem0 (base)	66.9%	+26% relative
Mem0g (graph)	68.5%	+29% relative

Efficiency metrics:

Latency (p95): 1.44s (Mem0) vs 17.12s (full context) = 91% reduction
Token consumption: 1.8K (Mem0) vs 26K (full context) = 90% reduction

Letta's counterclaim

Letta (formerly MemGPT) published that their simple agent achieves 74.0% on LOCOMO with GPT-4o mini and minimal prompt tuning—significantly above Mem0g's reported 68.5%. Letta also noted they could not reproduce Mem0's benchmarking methodology and received no response to clarification requests.

What this reveals:

Memory system benchmarking is immature (no standardized evaluation protocol)
LLM-as-Judge metrics are sensitive to prompt engineering and model choice
Production viability depends on more than accuracy (latency, cost, reliability matter)

Independent validation needed. Until then: test on your own data, measure latency and cost in your environment, don't trust published benchmarks alone.

Production memory architectures: case studies

Netflix: personalized content recommendations with episodic memory

Problem: recommend content based on viewing history, but context window can't hold months of watch history per user. Naive approach: embed recent watches. Result: misses long-term preferences.

Memory architecture:

Episodic memory: store every watch event (title, date, completion rate, genre)
Vector memory: embed user preference profiles extracted from watch history
Key-value: store user settings (preferred language, maturity filters)

Retrieval strategy:

Load user preferences from key-value (instant)
Retrieve semantically similar content from vector DB based on preference embeddings
Filter by recent episodic memory (exclude recently watched, prioritize incomplete series)
LLM generates personalized recommendation with reasoning

Result: recommendation quality improved without increasing context size. Agent maintains long-term preference understanding while adapting to recent behavior.

Lemonade: customer support with multi-session memory

Problem: customer support conversations span multiple sessions over weeks. Agent needs context from previous interactions without re-reading entire ticket history.

Memory architecture:

Semantic memory: knowledge base of policy details, FAQs, resolution procedures
Episodic memory: customer interaction history (issues reported, resolutions, sentiment)
Graph memory: entity relationships (customer, policy, claims, agents who helped)

Workflow:

Customer starts new conversation
Agent retrieves customer graph (previous issues, policy details, interaction patterns)
Agent retrieves semantically similar past resolutions from vector DB
Agent loads compressed summary of most recent interactions (not full transcripts)
Agent responds with full context, updates episodic memory with new interaction

Result: agents provide continuity across sessions without context window explosion. Memory update after each interaction enables continuous learning.

Rocket Money: financial advice with temporal reasoning

Problem: provide spending advice based on changing financial patterns over time. Cannot fit 12 months of transaction data in context.

Memory architecture:

Key-value: current account balances, user goals, budget limits
Episodic memory: transaction history with temporal metadata
Graph memory: spending categories, merchant relationships, recurring payments

Temporal reasoning example:

User asks: "Am I spending more on dining than usual?"

Agent queries episodic memory for dining transactions in last 30 days
Agent queries episodic memory for dining transactions 30-60 days ago
Agent compares totals, identifies trend
Agent retrieves specific high-cost dining events from episodic memory
Agent responds: "Yes, 22% increase. Drove by 3 dinners at [merchants] totaling $X"

Result: temporal reasoning without storing full transaction log in context. Graph memory enables "which merchants are new?" queries.

Memory management: what to remember, what to forget

Infinite memory creates infinite problems. Memory corruption, staleness, noise accumulation, retrieval latency, storage cost. Production memory systems need garbage collection.

Memory pruning strategies

1. Time-based decay

Age out old memories based on last access time
Works well for episodic memory (old interactions become less relevant)
Risk: loses important long-term context

2. Importance scoring

LLM assigns importance score to each memory on creation
Prune low-importance memories when storage limit reached
Risk: LLM might misjudge importance, delete critical context

3. Access-based retention

Keep memories that get retrieved frequently, prune rarely-accessed ones
Adaptive: naturally retains what's useful
Risk: cold-start problem (new memories compete with established ones)

4. Hierarchical summarization

Don't delete old memories, compress them into summaries
Store detailed memories for recent interactions, summaries for old ones
Best of both worlds: preserve long-term context, control storage growth

Conflict resolution in memory updates

New information contradicts existing memory. Example: user says "I prefer async communication" in January, then "I need real-time updates" in March. How does agent resolve?

Strategies:

Strategy	Approach	Use case
Last-write-wins	Replace old memory with new	Simple preferences that change over time
Versioning	Keep both, timestamp each	Track preference evolution, temporal reasoning
Contextual resolution	LLM decides which memory is relevant based on current context	Preferences that depend on situation (work vs personal)
Confidence weighting	Weight memories by confidence, blend when conflict	Uncertain or probabilistic information

Mem0g uses contextual resolution: when inserting conflicting triplet, LLM determines whether to replace, version, or merge based on semantic analysis.

Context compression: the Thread Transfer approach

Memory systems store historical context. But agents still need to work with active conversation context. Naive approach: concatenate everything. Result: token explosion. Compression required.

Context bundle compression strategy

Thread Transfer bundles implement progressive compression:

Initial turns (1-3): Preserve verbatim (agent needs full detail for immediate context)
Recent turns (4-10): Extract decisions, action items, key findings. Omit conversational filler
Historical turns (11+): Aggressive compression, only critical facts and outcomes

Compression rules by content type:

Content type	Compression approach
Decisions made	Preserve in full with attribution and timestamp
Action items	Preserve in full until completed, then summarize
Exploratory reasoning	Compress to final conclusion only
Rejected options	Omit entirely unless reasoning for rejection is important
Background context	Extract to separate knowledge base, reference by ID
User preferences	Extract to key-value memory, remove from conversation history

Measured compression efficiency

Real-world conversation, 15 turns, multi-agent collaboration:

Full history approach: 32,400 tokens
Naive summarization (LLM summarizes every 5 turns): 18,200 tokens
Thread Transfer progressive compression: 12,800 tokens (-60% vs full history)
Thread Transfer + memory extraction (preferences/facts to KV store): 8,400 tokens (-74% vs full history)

Compression preserves fidelity. Test: human reviewers evaluated whether compressed bundles contained sufficient context for next agent to continue work. Result: 94% fidelity score for Thread Transfer compression vs 97% for full history (3 percentage point drop for 74% token savings).

Latency optimization in memory retrieval

Memory retrieval adds latency to every agent action. Production systems need sub-second retrieval even as memory grows.

Latency breakdown in memory-augmented agents

Operation	Typical latency	Optimization approach
Query embedding generation	50-200ms	Cache embeddings for common queries
Vector similarity search	10-100ms	Use approximate nearest neighbor (ANN) algorithms
Graph traversal	20-500ms	Index hot paths, limit traversal depth
Key-value lookup	1-10ms	Keep in-memory cache for frequently accessed keys
Memory reranking	100-300ms	Rerank only top-K candidates, not all results

Parallel retrieval strategy:

Instead of sequential lookups (key-value, then vector, then graph), issue all retrievals in parallel. Aggregate results when all complete. Reduces total latency from sum to max of individual retrievals.

Example:

Sequential: KV (5ms) + vector (80ms) + graph (120ms) = 205ms
Parallel: max(5ms, 80ms, 120ms) = 120ms (-41%)

Retrieval budget management

Agentic RAG can trigger dozens of retrievals per task. Each retrieval adds latency and cost. Production systems need retrieval budgets.

Budget enforcement:

Max retrievals per agent action: 5-10 typical
Max tokens retrieved per action: 2K-5K typical
Timeout per retrieval: 500ms-1s typical

When budget exhausted, agent must work with what it has or fail gracefully. Better than infinite retrieval loops that timeout after 30 seconds.

Implementation checklist: shipping memory-augmented agents

Architecture decisions

Identify which memory types are needed (semantic, episodic, associative, or hybrid)
Select storage backends (vector DB, graph DB, key-value) based on access patterns
Define memory extraction rules (what gets stored from each interaction)
Establish memory pruning policy (time-based, importance-based, or hybrid)

Performance requirements

Set latency budget for memory retrieval (p95 target)
Define retrieval limits (max retrievals per action, max tokens retrieved)
Implement parallel retrieval to minimize total latency
Cache frequent queries (user preferences, common facts)

Quality assurance

Measure retrieval precision (% of retrieved memories that are relevant)
Measure retrieval recall (% of relevant memories that are retrieved)
Test conflict resolution (how does system handle contradictory information)
Validate compression fidelity (does compressed context preserve critical information)

Operational requirements

Log all memory operations (insertions, updates, retrievals) for debugging
Monitor memory growth rate (are you growing unbounded?)
Implement memory export/import for backup and portability
Add manual memory editing interface for correcting errors

Common failure modes and mitigations

Memory corruption

Symptom: Agent remembers incorrect information, hallucinates based on corrupted memory.

Causes:

LLM extracts wrong entities or relationships during memory creation
Conflict resolution merges incompatible information
User provides false information that gets stored as fact

Mitigations:

Validate extracted memories before storage (confidence thresholds)
Version all memories with timestamps and sources
Implement manual correction interface for humans to fix errors
Add "memory confidence" metadata, weight low-confidence memories less during retrieval

Retrieval irrelevance

Symptom: Retrieved memories are not relevant to current task, agent gets distracted.

Causes:

Poor embedding quality (semantically unrelated content scores high)
Query formulation problem (agent asks wrong question)
Memory index pollution (too much noise in storage)

Mitigations:

Use reranking after initial retrieval (cross-encoder models improve relevance)
Implement retrieval feedback loop (agent evaluates relevance, refines query)
Prune low-quality memories that are never accessed
Add metadata filters to narrow retrieval (date ranges, entity types, source)

Latency explosion

Symptom: Agent actions take 5-10+ seconds, primarily spent on memory operations.

Causes:

Too many sequential retrievals in critical path
Graph traversals with unbounded depth
No caching of frequently accessed memories

Mitigations:

Parallelize independent retrievals (vector + graph + KV in parallel)
Limit graph traversal depth (2-3 hops max)
Cache hot memories in-memory (user preferences, frequently accessed facts)
Set hard timeouts on all memory operations (fail fast if retrieval is slow)

Future directions: what's next for agent memory

Multi-agent shared memory

Current systems: each agent has isolated memory. Future: agents share memory pool, contribute to collective knowledge. Challenges: conflict resolution, attribution, access control.

Continual learning without retraining

Current systems: memory provides context, but LLM weights are frozen. Future: memory updates influence model behavior without full retraining. Requires techniques like parameter-efficient fine-tuning (PEFT) integrated with memory layer.

Privacy-preserving memory

Current systems: store user data in centralized memory. Future: federated memory (user data stays local, encrypted retrieval), differential privacy in memory storage. Critical for enterprise and regulated industries.

Memory portability and standards

Current systems: vendor-specific memory formats. Future: standardized memory interchange format, enabling agent memory to transfer between systems. Thread Transfer bundles are a step in this direction—portable, deterministic context packages.

Key takeaways

Memory is the gap between demo agents and production agents. Context windows are insufficient for multi-session, long-term agent interactions.
Three memory types: working (short-term context), semantic (knowledge retrieval via RAG), episodic/associative (events and relationships via graphs).
Mem0 achieves 26% higher accuracy than OpenAI memory on LOCOMO, with 91% latency reduction and 90% token savings. But methodology is disputed—test on your own data.
Memory management is critical: garbage collection, conflict resolution, compression. Infinite memory creates infinite problems.
Production memory systems need latency optimization: parallel retrieval, caching, retrieval budgets. Sub-second retrieval is non-negotiable.
Thread Transfer bundles compress multi-turn conversations by 60-74% while preserving 94% fidelity. Progressive compression + memory extraction enables scalable context management.

Learn more: How it works · Why bundles beat raw thread history