Thread Transfer
Multi-Agent System Design Patterns for Production
ChatDev achieves 33.3% correctness. Logistics systems show 27% throughput gains. Pattern choice matters more than model capability. Here's what actually works in production multi-agent systems.
Jorgo Bardho
Founder, Thread Transfer
Multi-agent systems have evolved from academic curiosities to production infrastructure. ChatDev achieves 33.3% correctness on real programming tasks. AppWorld shows 86.7% failure on cross-app workflows. Many multi-agent frameworks show minimal gains over single agents. Yet logistics systems demonstrate 27% throughput gains and 22% cost reduction. The pattern matters more than the promise. This is a breakdown of proven multi-agent system architectures, the coordination protocols that actually work in production, and the benchmarks that separate hype from reality.
The five core multi-agent patterns
Most production multi-agent systems map to one of five coordination patterns. Each pattern solves a different class of problem. Picking the wrong pattern is the fastest way to ship a system that looks impressive in demos and falls apart under load.
1. Sequential orchestration (pipeline pattern)
Agents execute in a fixed, linear order. Agent A processes input, passes output to Agent B, which passes to Agent C. This is the multi-agent equivalent of Unix pipes.
When to use:
- Clear dependencies between steps
- Each agent performs a distinct transformation
- Output quality improves through progressive refinement
- Deterministic workflows where order matters
Production example:
Document processing pipeline: Agent 1 extracts text from PDFs, Agent 2 classifies document type, Agent 3 extracts structured data based on classification, Agent 4 validates against business rules. Each step depends on the previous step's output. No parallelization needed. Failure at any stage halts the pipeline with clear error context.
Failure modes:
- Bottlenecks when one agent is significantly slower than others
- No error recovery if mid-pipeline agent fails
- Cannot handle dynamic branching or conditional logic
2. Supervisor/coordinator pattern (hierarchical orchestration)
A central coordinator agent breaks down goals into subtasks and delegates them to worker agents. The supervisor manages task allocation, monitors progress, handles failures, and aggregates results. Microsoft's TaskWeaver and most LangGraph implementations follow this pattern.
When to use:
- Complex goals that decompose into independent subtasks
- Need dynamic task allocation based on agent availability or capability
- Require centralized monitoring and error handling
- Single source of truth for task state is critical
Production example:
Customer support ticket routing: Supervisor agent receives ticket, classifies urgency and category, assigns to specialist agents (billing, technical, account), monitors resolution time, escalates if SLA breach is imminent, aggregates response for customer. Worker agents focus on domain expertise. Supervisor handles coordination overhead.
Key implementation details:
- Supervisor must be significantly more reliable than workers (use simpler model or rule-based logic)
- Include timeout and retry logic in delegation layer
- Log all task assignments for debugging coordination issues
- Monitor supervisor as single point of failure
3. Competitive pattern (evaluation-based selection)
Multiple agents independently solve the same problem. An evaluator agent (or human) selects the best solution based on criteria like accuracy, speed, cost, or creativity. This pattern trades compute cost for solution quality.
When to use:
- Problem has multiple valid solutions with varying quality
- Cost of wrong answer exceeds cost of redundant computation
- Evaluation criteria can be formalized or automated
- Latency budget allows parallel execution
Production example:
Code generation for critical infrastructure: Three agents generate implementations using different approaches (direct GPT-4, fine-tuned Codex, rule-based template + LLM refinement). Evaluator agent runs test suite on each, checks for security vulnerabilities, measures performance. Selects implementation with best test coverage and performance. Human reviews before deployment.
Cost analysis:
- 3x inference cost minimum (N agents + evaluator)
- Justified when error cost is high (medical, financial, security domains)
- Not viable for high-volume, low-margin applications
4. Network pattern (peer-to-peer coordination)
No central coordinator. Agents communicate directly with each other to coordinate work. Each agent has autonomy, specialized tools, and direct communication channels to peers. OpenAI Swarm and CrewAI implement this pattern.
When to use:
- No natural hierarchy or single point of control
- Agents need to dynamically form coalitions based on task requirements
- Resilience to single-agent failure is critical
- Communication overhead is manageable (small number of agents)
Production example:
Distributed sensor network monitoring: Each sensor agent monitors a zone, communicates with neighboring agents, detects anomalies through consensus, triggers alerts when multiple agents agree on threat. No central coordinator means no single point of failure. Agents self-organize around detected events.
Challenges:
- Consensus protocols add latency (O(N²) messages for N agents in worst case)
- Debugging is hard (no centralized log of decisions)
- Emergent behavior can be unpredictable
- Requires sophisticated agent reasoning to avoid deadlocks
5. Blackboard pattern (shared knowledge base)
Agents communicate indirectly through a shared data structure (the "blackboard"). Agents read from and write to the blackboard. A control mechanism decides which agent acts next based on blackboard state. Originally from expert systems research, now returning via vector databases and shared context stores.
When to use:
- Problem requires iterative refinement with uncertain solution path
- Multiple agents contribute partial solutions that must be synthesized
- Agent execution order depends on emerging data, not fixed workflow
- Need audit trail of all agent contributions
Production example:
Financial fraud detection: Transaction data written to blackboard. Pattern-matching agent writes anomaly scores. Historical comparison agent adds context. ML model agent generates risk prediction. Rule-based agent checks compliance thresholds. Final decision synthesizes all contributions. Each agent can act when sufficient data is available, no fixed order required.
Implementation with Thread Transfer:
Context bundles act as portable blackboards. Agent A writes analysis to bundle. Agent B reads bundle, adds findings. Agent C consumes full bundle for final decision. Bundles provide deterministic snapshots of multi-agent collaboration state, enabling debugging and replay. 40-80% token savings vs. passing full conversation history between agents.
Coordination protocols: what actually works
Patterns define structure. Protocols define how agents coordinate within that structure. Production data from industrial deployments shows which protocols are battle-tested.
Contract net protocol (47% of production systems)
Agent needing work broadcasts task announcement. Other agents bid based on capability and availability. Announcing agent selects best bid, awards contract. This is the most widely deployed coordination mechanism.
Why it dominates:
- Simple to implement and reason about
- Naturally handles agent heterogeneity and dynamic availability
- Scales to dozens of agents without modification
- Clear accountability (contract = explicit commitment)
Gotchas:
- Bidding adds latency (2-3 message round-trips before work starts)
- No built-in mechanism for complex multi-agent tasks requiring collaboration
- Winning agent might still fail to deliver (need timeout + re-auction logic)
Market-based mechanisms (29% of production systems)
Agents buy and sell resources or task allocations using virtual currency. Prices adjust based on supply and demand. This creates emergent task prioritization and load balancing without central control.
Use cases:
- Resource allocation in shared infrastructure (compute, API quota)
- Dynamic task prioritization when agent capacity fluctuates
- Systems where different agents have different operating costs
Implementation complexity:
- Requires virtual economy design (initial currency distribution, inflation control)
- Agents need bidding strategies (can they game the system?)
- Debugging is hard (price signals are indirect)
Distributed constraint optimization (18% of production systems)
Agents collaboratively find solutions that satisfy shared constraints while optimizing local and global objectives. Common in logistics, scheduling, and resource allocation problems.
When to use:
- Hard constraints that all agents must respect (regulatory, physical, budget)
- Local optimization by individual agents creates global suboptimality
- Centralized optimization is infeasible due to data privacy or scale
Reality check:
Requires specialized algorithms (DPOP, Max-Sum, DCOP). Implementation complexity is high. Most teams should start with simpler protocols and move to DCOP only when constraint satisfaction is the core problem.
Benchmarking multi-agent systems: the 2025 landscape
Production viability requires measurement. Four benchmarks have emerged as standards for evaluating multi-agent coordination.
MultiAgentBench: coordination quality
Evaluates collaboration and competition across diverse scenarios. Measures milestone achievement, not just final task completion. Supports multiple coordination topologies (star, chain, tree, graph).
Key findings:
| Metric | Best Model | Result |
|---|---|---|
| Highest task score | GPT-4o-mini | Beats larger models on coordination |
| Best coordination protocol | Graph structure | +8% vs star topology |
| Cognitive planning impact | All models | +3% milestone achievement |
Surprising result: smaller models with better coordination outperform larger models with worse coordination. Pattern choice matters more than model capability for many multi-agent tasks.
REALM-Bench: real-world planning
14 progressively complex planning and scheduling problems. Tests multi-agent coordination, inter-agent dependencies, dynamic environment disruptions. Compares GPT-4o, Claude-3.7, DeepSeek-R1 across LangGraph, AutoGen, CrewAI, Swarm.
What it reveals:
- Framework choice significantly impacts success rate (20-30% variance)
- Dynamic disruptions (environment changes mid-execution) cause 40-60% failure rates
- Inter-agent dependencies are the hardest coordination challenge
AgentVerse: interaction paradigm testing
Broadest environment coverage: collaborative problem-solving, competitive games, creative tasks, realistic simulations. Tests different agent architectures and communication protocols.
Use for:
- Comparing fundamentally different multi-agent approaches
- Validating that coordination works across diverse task types
- Stress-testing communication protocols under varying agent counts
Production metrics that matter
Research benchmarks are necessary but insufficient. Production systems need operational metrics:
| Metric | What it reveals | Target |
|---|---|---|
| Coordination overhead | % of time spent on agent-to-agent communication vs actual work | <20% |
| Single-agent fallback rate | % of tasks where multi-agent system reverts to single agent | <10% |
| Coordination failure rate | % of tasks failing due to coordination issues, not task difficulty | <5% |
| Token efficiency | Tokens consumed per task vs single-agent baseline | <2x single agent |
| Latency multiplier | P95 latency vs single-agent baseline | <3x single agent |
If your multi-agent system uses 5x more tokens and takes 10x longer than a single agent for 20% better accuracy, the pattern is wrong for the problem.
Production case studies: what ships vs what demos
Warehouse automation: network pattern wins
Problem: coordinate picking, packing, transportation agents in real-time warehouse operations. 200+ robots, dynamic order flow, equipment failures.
Pattern chosen: Network (peer-to-peer)
Results after 6 months:
- +27% order fulfillment rate
- -22% operational costs
- No single point of failure (supervisor pattern would have failed during equipment outages)
- Agents self-organize around bottlenecks
Critical implementation detail:
Each agent has strictly limited communication radius (only talks to 5 nearest neighbors). This prevents O(N²) message explosion. Graph-structured communication proved essential for scaling beyond 50 agents.
Predictive maintenance: supervisor pattern delivers
Problem: sensor agents monitor equipment, maintenance scheduling agents plan interventions, historical analysis agents identify failure patterns.
Pattern chosen: Supervisor (hierarchical orchestration)
Results across industrial deployments:
- -30-40% unplanned downtime
- -20-25% maintenance costs
- Centralized supervisor enables compliance auditing (critical for regulated industries)
Why supervisor worked:
Maintenance scheduling requires global optimization (can't have 5 agents all scheduling downtime simultaneously). Supervisor maintains global equipment state and resource allocation. Worker agents provide domain expertise without coordination overhead.
Document processing: sequential pattern is sufficient
Problem: legal document intake pipeline processing 10,000+ documents/day. Extract text, classify, extract entities, validate, route to human reviewers.
Pattern chosen: Sequential (pipeline)
Why not more complex:
- Each step has clear input/output contract
- No dynamic branching needed (classification determines validation rules, not routing)
- Monitoring and debugging are trivial (know exactly which stage failed)
- Horizontal scaling is simple (run more pipeline instances in parallel)
Key lesson:
Many teams over-engineer multi-agent coordination. If your workflow is deterministic and linear, sequential orchestration is sufficient. Complex patterns add latency, token overhead, and debugging complexity without benefit.
Context management in multi-agent systems
Multi-agent coordination creates context explosion. Agent A generates analysis. Agent B needs that analysis plus original input. Agent C needs A's analysis, B's output, and original input. Naive approach: concatenate everything. Result: exponential token growth.
The context bundle approach
Thread Transfer bundles compress multi-agent collaboration context into deterministic snapshots. Instead of passing full conversation histories between agents, bundle distills decisions, key findings, and necessary context.
Example: three-agent analysis workflow
Naive approach (full history):
- Agent A receives input: 2,000 tokens
- Agent B receives input + Agent A output: 5,500 tokens
- Agent C receives input + Agent A output + Agent B output: 11,200 tokens
- Total: 18,700 tokens
Bundle approach:
- Agent A receives input: 2,000 tokens
- Agent A writes findings to bundle: 800 tokens (compressed)
- Agent B receives bundle: 2,800 tokens
- Agent B writes findings to bundle: 1,200 tokens (cumulative)
- Agent C receives bundle: 3,200 tokens
- Total: 8,000 tokens (-57% vs naive)
Bundles also enable audit trails. Each agent's contribution is timestamped and attributed. When multi-agent workflow produces wrong output, you can trace which agent introduced the error.
When to compress, when to preserve
Not all context should be compressed. Compression rules for multi-agent systems:
| Preserve in full | Compress aggressively | Omit entirely |
|---|---|---|
| Final decisions | Intermediate reasoning | Exploratory dead ends |
| Key findings | Supporting evidence | Redundant confirmations |
| Action items | Background context | Conversational filler |
| Constraint violations | Constraint checks (passed) | Agent-to-agent coordination messages |
Failure modes and how to avoid them
Coordination thrashing
Symptom: Agents spend more time coordinating than working. Token consumption dominated by agent-to-agent messages, not actual task processing.
Causes:
- Too many agents for the problem complexity
- Peer-to-peer communication without locality constraints (O(N²) message explosion)
- Overly chatty coordination protocol (agents re-negotiate repeatedly)
Fix:
- Start with fewer agents, add only when bottlenecks prove it necessary
- Implement communication locality (agents only talk to neighbors)
- Use publish-subscribe instead of point-to-point for broadcasts
- Add coordination overhead budget (if >30% of time is coordination, simplify)
Emergent deadlocks
Symptom: Multi-agent workflow stalls indefinitely. No agent makes progress. No error thrown.
Causes:
- Circular dependencies (Agent A waits for B, B waits for C, C waits for A)
- Resource contention with no timeout or priority mechanism
- Consensus protocols that require unanimity with a failed agent
Fix:
- All inter-agent waits must have timeouts (never infinite waits)
- Implement deadlock detection (if no agent makes progress for N seconds, raise alert)
- Use resource allocation with priority or preemption
- Avoid consensus protocols requiring 100% agreement (use quorum instead)
Context drift
Symptom: Agent outputs progressively diverge from original task. Final result doesn't answer the initial question.
Causes:
- Each agent in chain slightly misinterprets previous agent's output
- No agent has access to original task specification
- Compression loses critical details
Fix:
- All agents in workflow must receive original task specification (immutable context)
- Periodic "drift checks" where supervisor validates alignment with original goal
- Structured outputs between agents (JSON schemas, not natural language)
When to use multi-agent vs single-agent
Multi-agent systems are not universally superior. Decision framework:
Use single agent when:
- Problem is well-defined with clear steps
- All required context fits in single agent's context window
- Latency budget is tight (multi-agent coordination adds 2-5x latency)
- Token budget is constrained (multi-agent uses 1.5-3x tokens typically)
- Debugging complexity must be minimized
Use multi-agent when:
- Task naturally decomposes into specialized subtasks
- Different subtasks require different tools, models, or expertise
- Parallelization can reduce wall-clock time (despite higher token cost)
- Fault isolation is critical (agent failures should be contained)
- Solution quality is worth 2-3x cost increase
The incremental migration path
Don't rebuild your entire system as multi-agent overnight. Migration strategy:
- Start with single agent doing everything: Establish baseline performance (latency, tokens, accuracy)
- Identify the clear bottleneck: Which subtask is slowest? Most error-prone? Most token-intensive?
- Extract that subtask to a specialist agent: Implement handoff, measure improvement
- Add second specialist only if first specialist proved value: Did latency/accuracy/cost improve?
- Iterate until marginal benefit stops justifying marginal cost: Most systems stabilize at 3-5 agents
Implementation checklist
Before shipping a multi-agent system to production:
Architecture
- Pattern choice maps to problem structure (not chosen for novelty)
- Coordination protocol is battle-tested (contract net, market-based, or DCOP)
- Single points of failure are eliminated or have fallback
- Communication topology prevents O(N²) message explosion
Observability
- Every agent action is logged with timestamps and attribution
- Coordination overhead is measured as % of total execution time
- Token consumption is tracked per agent and per task
- Deadlock detection alerts if workflow stalls for >N seconds
- Context drift is measured (periodic comparison to original task spec)
Reliability
- All inter-agent waits have timeouts (never infinite waits)
- Failed agents trigger retries or fallback to simpler approach
- Circular dependencies are impossible by construction (or detected and broken)
- Resource contention has priority/preemption mechanism
Cost management
- Token budget per task is enforced (kill runaway workflows)
- Multi-agent overhead is justified by quality improvement (measured, not assumed)
- Single-agent fallback exists for cost-sensitive tasks
- Context compression reduces redundant token consumption
What's next: 2025 and beyond
Multi-agent systems are moving from research demos to production infrastructure. The gap between promise and reality is closing, but slowly. Key trends to watch:
- Smaller models with better coordination outperforming larger models with worse coordination: GPT-4o-mini beats GPT-4 on MultiAgentBench coordination tasks
- Graph topologies replacing star topologies: +8% performance in research scenarios, likely to become standard
- Cognitive planning adds 3% milestone achievement: Small but consistent gains across all models
- Context bundles replacing raw history passing: 40-80% token savings with deterministic snapshots
Start simple. Measure relentlessly. Add complexity only when single-agent baselines prove insufficient. Multi-agent systems are powerful tools, not magic bullets.
Learn more: How it works · Why bundles beat raw thread history