Thread Transfer

Multi-Agent System Design Patterns for Production

ChatDev achieves 33.3% correctness. Logistics systems show 27% throughput gains. Pattern choice matters more than model capability. Here's what actually works in production multi-agent systems.

Jorgo Bardho

Founder, Thread Transfer

July 6, 2025•22 min read

multi-agentAI agentssystem designcoordinationproduction

Multi-agent system coordination patterns

Multi-agent systems have evolved from academic curiosities to production infrastructure. ChatDev achieves 33.3% correctness on real programming tasks. AppWorld shows 86.7% failure on cross-app workflows. Many multi-agent frameworks show minimal gains over single agents. Yet logistics systems demonstrate 27% throughput gains and 22% cost reduction. The pattern matters more than the promise. This is a breakdown of proven multi-agent system architectures, the coordination protocols that actually work in production, and the benchmarks that separate hype from reality.

The five core multi-agent patterns

Most production multi-agent systems map to one of five coordination patterns. Each pattern solves a different class of problem. Picking the wrong pattern is the fastest way to ship a system that looks impressive in demos and falls apart under load.

1. Sequential orchestration (pipeline pattern)

Agents execute in a fixed, linear order. Agent A processes input, passes output to Agent B, which passes to Agent C. This is the multi-agent equivalent of Unix pipes.

When to use:

Clear dependencies between steps
Each agent performs a distinct transformation
Output quality improves through progressive refinement
Deterministic workflows where order matters

Production example:

Document processing pipeline: Agent 1 extracts text from PDFs, Agent 2 classifies document type, Agent 3 extracts structured data based on classification, Agent 4 validates against business rules. Each step depends on the previous step's output. No parallelization needed. Failure at any stage halts the pipeline with clear error context.

Failure modes:

Bottlenecks when one agent is significantly slower than others
No error recovery if mid-pipeline agent fails
Cannot handle dynamic branching or conditional logic

2. Supervisor/coordinator pattern (hierarchical orchestration)

A central coordinator agent breaks down goals into subtasks and delegates them to worker agents. The supervisor manages task allocation, monitors progress, handles failures, and aggregates results. Microsoft's TaskWeaver and most LangGraph implementations follow this pattern.

When to use:

Complex goals that decompose into independent subtasks
Need dynamic task allocation based on agent availability or capability
Require centralized monitoring and error handling
Single source of truth for task state is critical

Production example:

Customer support ticket routing: Supervisor agent receives ticket, classifies urgency and category, assigns to specialist agents (billing, technical, account), monitors resolution time, escalates if SLA breach is imminent, aggregates response for customer. Worker agents focus on domain expertise. Supervisor handles coordination overhead.

Key implementation details:

Supervisor must be significantly more reliable than workers (use simpler model or rule-based logic)
Include timeout and retry logic in delegation layer
Log all task assignments for debugging coordination issues
Monitor supervisor as single point of failure

3. Competitive pattern (evaluation-based selection)

Multiple agents independently solve the same problem. An evaluator agent (or human) selects the best solution based on criteria like accuracy, speed, cost, or creativity. This pattern trades compute cost for solution quality.

When to use:

Problem has multiple valid solutions with varying quality
Cost of wrong answer exceeds cost of redundant computation
Evaluation criteria can be formalized or automated
Latency budget allows parallel execution

Production example:

Code generation for critical infrastructure: Three agents generate implementations using different approaches (direct GPT-4, fine-tuned Codex, rule-based template + LLM refinement). Evaluator agent runs test suite on each, checks for security vulnerabilities, measures performance. Selects implementation with best test coverage and performance. Human reviews before deployment.

Cost analysis:

3x inference cost minimum (N agents + evaluator)
Justified when error cost is high (medical, financial, security domains)
Not viable for high-volume, low-margin applications

4. Network pattern (peer-to-peer coordination)

No central coordinator. Agents communicate directly with each other to coordinate work. Each agent has autonomy, specialized tools, and direct communication channels to peers. OpenAI Swarm and CrewAI implement this pattern.

When to use:

No natural hierarchy or single point of control
Agents need to dynamically form coalitions based on task requirements
Resilience to single-agent failure is critical
Communication overhead is manageable (small number of agents)

Production example:

Distributed sensor network monitoring: Each sensor agent monitors a zone, communicates with neighboring agents, detects anomalies through consensus, triggers alerts when multiple agents agree on threat. No central coordinator means no single point of failure. Agents self-organize around detected events.

Challenges:

Consensus protocols add latency (O(N²) messages for N agents in worst case)
Debugging is hard (no centralized log of decisions)
Emergent behavior can be unpredictable
Requires sophisticated agent reasoning to avoid deadlocks

5. Blackboard pattern (shared knowledge base)

Agents communicate indirectly through a shared data structure (the "blackboard"). Agents read from and write to the blackboard. A control mechanism decides which agent acts next based on blackboard state. Originally from expert systems research, now returning via vector databases and shared context stores.

When to use:

Problem requires iterative refinement with uncertain solution path
Multiple agents contribute partial solutions that must be synthesized
Agent execution order depends on emerging data, not fixed workflow
Need audit trail of all agent contributions

Production example:

Financial fraud detection: Transaction data written to blackboard. Pattern-matching agent writes anomaly scores. Historical comparison agent adds context. ML model agent generates risk prediction. Rule-based agent checks compliance thresholds. Final decision synthesizes all contributions. Each agent can act when sufficient data is available, no fixed order required.

Implementation with Thread Transfer:

Context bundles act as portable blackboards. Agent A writes analysis to bundle. Agent B reads bundle, adds findings. Agent C consumes full bundle for final decision. Bundles provide deterministic snapshots of multi-agent collaboration state, enabling debugging and replay. 40-80% token savings vs. passing full conversation history between agents.

Coordination protocols: what actually works

Patterns define structure. Protocols define how agents coordinate within that structure. Production data from industrial deployments shows which protocols are battle-tested.

Contract net protocol (47% of production systems)

Agent needing work broadcasts task announcement. Other agents bid based on capability and availability. Announcing agent selects best bid, awards contract. This is the most widely deployed coordination mechanism.

Why it dominates:

Simple to implement and reason about
Naturally handles agent heterogeneity and dynamic availability
Scales to dozens of agents without modification
Clear accountability (contract = explicit commitment)

Gotchas:

Bidding adds latency (2-3 message round-trips before work starts)
No built-in mechanism for complex multi-agent tasks requiring collaboration
Winning agent might still fail to deliver (need timeout + re-auction logic)

Market-based mechanisms (29% of production systems)

Agents buy and sell resources or task allocations using virtual currency. Prices adjust based on supply and demand. This creates emergent task prioritization and load balancing without central control.

Use cases:

Resource allocation in shared infrastructure (compute, API quota)
Dynamic task prioritization when agent capacity fluctuates
Systems where different agents have different operating costs

Implementation complexity:

Requires virtual economy design (initial currency distribution, inflation control)
Agents need bidding strategies (can they game the system?)
Debugging is hard (price signals are indirect)

Distributed constraint optimization (18% of production systems)

Agents collaboratively find solutions that satisfy shared constraints while optimizing local and global objectives. Common in logistics, scheduling, and resource allocation problems.

When to use:

Hard constraints that all agents must respect (regulatory, physical, budget)
Local optimization by individual agents creates global suboptimality
Centralized optimization is infeasible due to data privacy or scale

Reality check:

Requires specialized algorithms (DPOP, Max-Sum, DCOP). Implementation complexity is high. Most teams should start with simpler protocols and move to DCOP only when constraint satisfaction is the core problem.

Benchmarking multi-agent systems: the 2025 landscape

Production viability requires measurement. Four benchmarks have emerged as standards for evaluating multi-agent coordination.

MultiAgentBench: coordination quality

Evaluates collaboration and competition across diverse scenarios. Measures milestone achievement, not just final task completion. Supports multiple coordination topologies (star, chain, tree, graph).

Key findings:

Metric	Best Model	Result
Highest task score	GPT-4o-mini	Beats larger models on coordination
Best coordination protocol	Graph structure	+8% vs star topology
Cognitive planning impact	All models	+3% milestone achievement

Surprising result: smaller models with better coordination outperform larger models with worse coordination. Pattern choice matters more than model capability for many multi-agent tasks.

REALM-Bench: real-world planning

14 progressively complex planning and scheduling problems. Tests multi-agent coordination, inter-agent dependencies, dynamic environment disruptions. Compares GPT-4o, Claude-3.7, DeepSeek-R1 across LangGraph, AutoGen, CrewAI, Swarm.

What it reveals:

Framework choice significantly impacts success rate (20-30% variance)
Dynamic disruptions (environment changes mid-execution) cause 40-60% failure rates
Inter-agent dependencies are the hardest coordination challenge

AgentVerse: interaction paradigm testing

Broadest environment coverage: collaborative problem-solving, competitive games, creative tasks, realistic simulations. Tests different agent architectures and communication protocols.

Use for:

Comparing fundamentally different multi-agent approaches
Validating that coordination works across diverse task types
Stress-testing communication protocols under varying agent counts

Production metrics that matter

Research benchmarks are necessary but insufficient. Production systems need operational metrics:

Metric	What it reveals	Target
Coordination overhead	% of time spent on agent-to-agent communication vs actual work	<20%
Single-agent fallback rate	% of tasks where multi-agent system reverts to single agent	<10%
Coordination failure rate	% of tasks failing due to coordination issues, not task difficulty	<5%
Token efficiency	Tokens consumed per task vs single-agent baseline	<2x single agent
Latency multiplier	P95 latency vs single-agent baseline	<3x single agent

If your multi-agent system uses 5x more tokens and takes 10x longer than a single agent for 20% better accuracy, the pattern is wrong for the problem.

Production case studies: what ships vs what demos

Warehouse automation: network pattern wins

Problem: coordinate picking, packing, transportation agents in real-time warehouse operations. 200+ robots, dynamic order flow, equipment failures.

Pattern chosen: Network (peer-to-peer)

Results after 6 months:

+27% order fulfillment rate
-22% operational costs
No single point of failure (supervisor pattern would have failed during equipment outages)
Agents self-organize around bottlenecks

Critical implementation detail:

Each agent has strictly limited communication radius (only talks to 5 nearest neighbors). This prevents O(N²) message explosion. Graph-structured communication proved essential for scaling beyond 50 agents.

Predictive maintenance: supervisor pattern delivers

Problem: sensor agents monitor equipment, maintenance scheduling agents plan interventions, historical analysis agents identify failure patterns.

Pattern chosen: Supervisor (hierarchical orchestration)

Results across industrial deployments:

-30-40% unplanned downtime
-20-25% maintenance costs
Centralized supervisor enables compliance auditing (critical for regulated industries)

Why supervisor worked:

Maintenance scheduling requires global optimization (can't have 5 agents all scheduling downtime simultaneously). Supervisor maintains global equipment state and resource allocation. Worker agents provide domain expertise without coordination overhead.

Document processing: sequential pattern is sufficient

Problem: legal document intake pipeline processing 10,000+ documents/day. Extract text, classify, extract entities, validate, route to human reviewers.

Pattern chosen: Sequential (pipeline)

Why not more complex:

Each step has clear input/output contract
No dynamic branching needed (classification determines validation rules, not routing)
Monitoring and debugging are trivial (know exactly which stage failed)
Horizontal scaling is simple (run more pipeline instances in parallel)

Key lesson:

Many teams over-engineer multi-agent coordination. If your workflow is deterministic and linear, sequential orchestration is sufficient. Complex patterns add latency, token overhead, and debugging complexity without benefit.

Context management in multi-agent systems

Multi-agent coordination creates context explosion. Agent A generates analysis. Agent B needs that analysis plus original input. Agent C needs A's analysis, B's output, and original input. Naive approach: concatenate everything. Result: exponential token growth.

The context bundle approach

Thread Transfer bundles compress multi-agent collaboration context into deterministic snapshots. Instead of passing full conversation histories between agents, bundle distills decisions, key findings, and necessary context.

Example: three-agent analysis workflow

Naive approach (full history):

Agent A receives input: 2,000 tokens
Agent B receives input + Agent A output: 5,500 tokens
Agent C receives input + Agent A output + Agent B output: 11,200 tokens
Total: 18,700 tokens

Bundle approach:

Agent A receives input: 2,000 tokens
Agent A writes findings to bundle: 800 tokens (compressed)
Agent B receives bundle: 2,800 tokens
Agent B writes findings to bundle: 1,200 tokens (cumulative)
Agent C receives bundle: 3,200 tokens
Total: 8,000 tokens (-57% vs naive)

Bundles also enable audit trails. Each agent's contribution is timestamped and attributed. When multi-agent workflow produces wrong output, you can trace which agent introduced the error.

When to compress, when to preserve

Not all context should be compressed. Compression rules for multi-agent systems:

Preserve in full	Compress aggressively	Omit entirely
Final decisions	Intermediate reasoning	Exploratory dead ends
Key findings	Supporting evidence	Redundant confirmations
Action items	Background context	Conversational filler
Constraint violations	Constraint checks (passed)	Agent-to-agent coordination messages

Failure modes and how to avoid them

Coordination thrashing

Symptom: Agents spend more time coordinating than working. Token consumption dominated by agent-to-agent messages, not actual task processing.

Causes:

Too many agents for the problem complexity
Peer-to-peer communication without locality constraints (O(N²) message explosion)
Overly chatty coordination protocol (agents re-negotiate repeatedly)

Fix:

Start with fewer agents, add only when bottlenecks prove it necessary
Implement communication locality (agents only talk to neighbors)
Use publish-subscribe instead of point-to-point for broadcasts
Add coordination overhead budget (if >30% of time is coordination, simplify)

Emergent deadlocks

Symptom: Multi-agent workflow stalls indefinitely. No agent makes progress. No error thrown.

Causes:

Circular dependencies (Agent A waits for B, B waits for C, C waits for A)
Resource contention with no timeout or priority mechanism
Consensus protocols that require unanimity with a failed agent

Fix:

All inter-agent waits must have timeouts (never infinite waits)
Implement deadlock detection (if no agent makes progress for N seconds, raise alert)
Use resource allocation with priority or preemption
Avoid consensus protocols requiring 100% agreement (use quorum instead)

Context drift

Symptom: Agent outputs progressively diverge from original task. Final result doesn't answer the initial question.

Causes:

Each agent in chain slightly misinterprets previous agent's output
No agent has access to original task specification
Compression loses critical details

Fix:

All agents in workflow must receive original task specification (immutable context)
Periodic "drift checks" where supervisor validates alignment with original goal
Structured outputs between agents (JSON schemas, not natural language)

When to use multi-agent vs single-agent

Multi-agent systems are not universally superior. Decision framework:

Use single agent when:

Problem is well-defined with clear steps
All required context fits in single agent's context window
Latency budget is tight (multi-agent coordination adds 2-5x latency)
Token budget is constrained (multi-agent uses 1.5-3x tokens typically)
Debugging complexity must be minimized

Use multi-agent when:

Task naturally decomposes into specialized subtasks
Different subtasks require different tools, models, or expertise
Parallelization can reduce wall-clock time (despite higher token cost)
Fault isolation is critical (agent failures should be contained)
Solution quality is worth 2-3x cost increase

The incremental migration path

Don't rebuild your entire system as multi-agent overnight. Migration strategy:

Start with single agent doing everything: Establish baseline performance (latency, tokens, accuracy)
Identify the clear bottleneck: Which subtask is slowest? Most error-prone? Most token-intensive?
Extract that subtask to a specialist agent: Implement handoff, measure improvement
Add second specialist only if first specialist proved value: Did latency/accuracy/cost improve?
Iterate until marginal benefit stops justifying marginal cost: Most systems stabilize at 3-5 agents

Implementation checklist

Before shipping a multi-agent system to production:

Architecture

Pattern choice maps to problem structure (not chosen for novelty)
Coordination protocol is battle-tested (contract net, market-based, or DCOP)
Single points of failure are eliminated or have fallback
Communication topology prevents O(N²) message explosion

Observability

Every agent action is logged with timestamps and attribution
Coordination overhead is measured as % of total execution time
Token consumption is tracked per agent and per task
Deadlock detection alerts if workflow stalls for >N seconds
Context drift is measured (periodic comparison to original task spec)

Reliability

All inter-agent waits have timeouts (never infinite waits)
Failed agents trigger retries or fallback to simpler approach
Circular dependencies are impossible by construction (or detected and broken)
Resource contention has priority/preemption mechanism

Cost management

Token budget per task is enforced (kill runaway workflows)
Multi-agent overhead is justified by quality improvement (measured, not assumed)
Single-agent fallback exists for cost-sensitive tasks
Context compression reduces redundant token consumption

What's next: 2025 and beyond

Multi-agent systems are moving from research demos to production infrastructure. The gap between promise and reality is closing, but slowly. Key trends to watch:

Smaller models with better coordination outperforming larger models with worse coordination: GPT-4o-mini beats GPT-4 on MultiAgentBench coordination tasks
Graph topologies replacing star topologies: +8% performance in research scenarios, likely to become standard
Cognitive planning adds 3% milestone achievement: Small but consistent gains across all models
Context bundles replacing raw history passing: 40-80% token savings with deterministic snapshots

Start simple. Measure relentlessly. Add complexity only when single-agent baselines prove insufficient. Multi-agent systems are powerful tools, not magic bullets.

Learn more: How it works · Why bundles beat raw thread history