Multi-Agent Systems: Coordinating AI Agents for Complex Tasks
System Design Deep Dive — #5 of 20 | This is part of a 20-post series covering the most critical system design topics. Follow to get the next one.
ChatGPT can write code. But can it research a problem, write the implementation, review its own code for bugs, write tests, fix what fails, and document the result? Not reliably in one shot. This is why companies like Cognition (Devin), Factory AI, and Microsoft (AutoGen) are building multi-agent systems -- specialized AI agents that collaborate like a human engineering team.
TL;DR: Multi-agent systems divide complex tasks among specialized agents with distinct roles, tools, and evaluation criteria. The orchestration layer (how agents coordinate) determines system effectiveness more than individual agent quality. Start with two agents, add cross-validation, and only scale when you've proven the pattern works.
The Problem
Complex tasks overwhelm single agents. When one agent tries to be researcher, coder, reviewer, and tester simultaneously, it loses focus. The context window gets polluted, reasoning quality drops, and errors compound. Each role requires different expertise, different context, and different evaluation criteria.
Multi-agent systems mirror how human teams operate: specialize roles, coordinate handoffs, and cross-check each other's work.
How Multi-Agent Systems Work
Agent Specialization
Each agent is optimized for a specific role with tailored system prompts, tools, and evaluation criteria:
agents = { "researcher": Agent( role="Research and gather information", tools=["web_search", "document_reader"], model="gpt-4o", ), "coder": Agent( role="Write and debug code", tools=["code_executor", "file_writer"], model="gpt-4o", ), "reviewer": Agent( role="Review code for bugs and improvements", tools=["code_reader", "linter"], model="gpt-4o", ), }
Specialized agents consistently outperform generalist agents because they carry less irrelevant context and their prompts are focused on a single competency.
Orchestration Layer
Someone needs to be the project manager. The orchestrator -- often another agent -- coordinates work:
Task decomposition: break a complex task into subtasks with clear inputs/outputs
Dependency graph: determine what runs in parallel vs. sequentially
Routing: match each subtask to the right specialized agent
Conflict resolution: when agents disagree, the orchestrator decides (or escalates)
Here's what a real orchestrator loop looks like:
async def orchestrate(task: str, agents: dict, max_rounds: int = 10): """Core orchestration loop — decompose, delegate, validate, repeat.""" plan = await agents["planner"].run( f"Break this into subtasks with dependencies: {task}" ) # plan = [{"id": 1, "agent": "researcher", "input": "...", "depends_on": []}, # {"id": 2, "agent": "coder", "input": "...", "depends_on": [1]}, ...]
results = {} # task_id -> output for round in range(max_rounds): # Find tasks whose dependencies are all satisfied ready = [t for t in plan if t["id"] not in results and all(d in results for d in t["depends_on"])] if not ready: break # All tasks complete
# Run independent tasks in parallel parallel_results = await asyncio.gather(*[ agents[t["agent"]].run( t["input"], context={did: results[did] for did in t["depends_on"]} ) for t in ready ])
for task_spec, result in zip(ready, parallel_results): # Validate output before accepting validation = await agents["reviewer"].run( f"Validate this output for task '{task_spec['input']}': {result}" ) if validation.approved: results[task_spec["id"]] = result else: # Re-run with feedback — the agent gets the reviewer's critique plan.append({ "id": task_spec["id"], "agent": task_spec["agent"], "input": f"{task_spec['input']}\nFeedback: {validation.reason}", "depends_on": task_spec["depends_on"], })
return results
The key insight: the orchestrator doesn't do the work -- it manages the dependency graph and validation loop. Notice how failed validations re-enqueue the task with feedback, creating a self-correcting cycle. This pattern (plan → execute → validate → retry) is the backbone of every production multi-agent system I've seen work reliably.
Communication Protocols
Agents need structured ways to exchange information. The protocol choice determines how tightly coupled agents are, how you debug failures, and whether the system can scale beyond 3-4 agents.
Communication Pattern Latency Coupling Best For
Message passing Medium Loose Async workflows, event-driven
Shared memory Low Tight Fast iteration, small teams
Blackboard Medium Medium Knowledge accumulation
Function calling Low Tight Direct delegation
Message passing is the most common pattern in production. Each agent sends structured messages with a defined schema:
@dataclass class AgentMessage: sender: str # "researcher" recipient: str # "coder" or "orchestrator" msg_type: str # "result", "error", "clarification_needed" content: dict # The actual payload parent_task_id: str # Links back to the orchestrator's plan timestamp: float
# Researcher sends findings to the orchestrator msg = AgentMessage( sender="researcher", recipient="orchestrator", msg_type="result", content={ "findings": "Redis supports sorted sets for leaderboards...", "confidence": 0.92, "sources": ["redis.io/docs/data-types/sorted-sets/"], }, parent_task_id="task-001", timestamp=time.time(), )
Shared memory (also called a "scratchpad") works better for tight iteration loops where agents need to see each other's work in real time -- think of it as a shared Google Doc. AutoGen and CrewAI both support this pattern. The tradeoff: it creates implicit coupling, and debugging becomes harder because any agent can modify the shared state at any time.
Blackboard architecture is the hybrid -- a central knowledge store that agents read from and write to, but with structured rules about who can update which sections. This is how MetaGPT's SOP-driven approach works: the researcher writes to the "research" section, the coder reads from it and writes to the "code" section, and the reviewer reads both.
Consensus and Validation
One of the most powerful patterns in multi-agent systems is cross-validation. Multiple agents check each other's work:
Debate: two agents argue opposing positions, and a judge agent decides
Voting: multiple agents independently solve the same problem, and the majority answer wins
Hierarchical review: a senior agent reviews and approves junior agent output
This significantly reduces errors compared to a single agent working alone.
State Management
Tracking the overall state across multiple agents is the hardest operational challenge. You need to know which agent did what, when, and why -- and handle situations where agents produce conflicting results or one agent fails mid-task.
Here's a practical state manager that handles the core problems -- concurrency, conflict detection, and rollback:
class AgentStateManager: def __init__(self): self.state = {} # Current shared state self.history = [] # Append-only log of all changes self.locks = {} # Per-key locks for write safety
async def update(self, agent_id: str, key: str, value: any): """Write to shared state with optimistic locking.""" async with self.locks.setdefault(key, asyncio.Lock()): old_value = self.state.get(key) self.history.append({ "agent": agent_id, "key": key, "old": old_value, "new": value, "timestamp": time.time(), }) self.state[key] = value
def rollback_agent(self, agent_id: str): """Undo all changes by a specific agent (reverse order).""" agent_changes = [h for h in self.history if h["agent"] == agent_id] for change in reversed(agent_changes): self.state[change["key"]] = change["old"] self.history.remove(change)
def get_agent_contributions(self, agent_id: str) -> list: """Audit trail: what did this agent change and when?""" return [h for h in self.history if h["agent"] == agent_id]
Three things make or break state management in multi-agent systems:
Append-only history -- Never overwrite without logging. When something goes wrong (and it will), the history log is how you debug which agent produced the bad output and what state they saw when they made that decision.
Per-agent rollback -- If the reviewer rejects the coder's output, you need to undo the coder's state changes without affecting the researcher's contributions. This is why the history tracks agent_id per change.
Token budget tracking -- Multi-agent systems can burn through API credits fast. Track cumulative token usage per agent and set hard limits. A runaway researcher agent doing infinite web searches at $0.01 per call adds up when it runs 500 iterations.
When to Use Multi-Agent Systems
Multi-agent systems add real complexity. Use them when:
The task genuinely requires multiple distinct competencies
A single agent's context window can't hold all the needed information
Cross-validation would meaningfully improve output quality
Sub-tasks can be parallelized for speed
Don't use them for tasks a single agent handles well. Start with one agent, identify where it fails, and split only those responsibilities.
5 Hidden Gotchas That Will Bite You in Production
Multi-agent systems are the new frontier — and they multiply every single-agent failure mode by the number of agents. Andrew Ng has noted that agentic workflows are "the most important trend in AI" — but they're also the most operationally complex:
1. Agent Deadlock
Your Researcher agent calls the Coder agent: "Implement the solution from my research." The Coder agent calls back to the Researcher: "I need more details before I can implement." Both agents wait for each other indefinitely, consuming tokens on every "waiting" message. This is the distributed systems deadlock problem — but with LLM calls at $0.01+ each. A 2-hour deadlock loop between two GPT-4 agents costs ~$50 in wasted tokens.
Fix: Implement timeouts on every inter-agent call (30-60 seconds). The orchestrator monitors agent-to-agent call graphs and detects cycles. On timeout, the orchestrator forces resolution: either provides a default response or escalates to a human. Never allow bilateral agent-to-agent calls — route all communication through the orchestrator.
2. Conflicting Actions
The Researcher agent edits report.md to add findings. The Coder agent simultaneously edits report.md to add code examples. Neither knows about the other's changes. The last write wins — and one agent's work is silently lost. This is the classic concurrent write problem, but with the added complexity that agents can't detect or resolve merge conflicts.
Fix: Resource-level locks: only one agent can hold a write lock on a file at a time. The orchestrator manages the lock table. Or use a turn-based architecture: agents take turns in a defined sequence (Research → Code → Review), passing artifacts forward like a relay race. For shared resources, use append-only semantics — agents add to a shared scratchpad rather than editing each other's work.
3. Cost Amplification
The orchestrator routes a task to 3 specialist agents. Each agent makes 4 tool calls (each tool call includes the full conversation context). Each tool call triggers a sub-agent for validation. One user request → 3 agents × 4 tool calls × 1 sub-agent = 12 LLM calls. At 5,000 tokens per call, that's 60,000 tokens for one user request. Now multiply by 10,000 daily users. The cost grows not linearly with agents, but multiplicatively with the depth of agent delegation.
Fix: Budget propagation: the orchestrator allocates a token budget per request. Each agent receives a fraction and must operate within it. Set depth limits (max 2 levels of agent delegation). Cache tool call results so repeated queries don't trigger new LLM calls. Use cheaper models for routine sub-tasks (GPT-4o-mini for validation, GPT-4o for reasoning).
4. Blame Attribution
The multi-agent system produces a bug report with an incorrect root cause analysis. The workflow: Researcher gathered logs → Analyzer identified the wrong component → Coder proposed a fix for the wrong component → Reviewer approved it. Who made the mistake? Without structured per-agent logging, debugging requires reading through 50+ LLM interactions across 4 agents.
Fix: Every agent logs its inputs, reasoning, tool calls, and outputs as structured events with a shared trace_id. Use OpenTelemetry-style spans: each agent call is a span with parent-child relationships. Build a trace viewer (Langfuse, Arize Phoenix, or LangSmith) that shows the full decision tree. When output is wrong, trace backward from the output to the first agent that deviated.
5. State Desynchronization
Agent A reads a config file at 10:00:01. Agent B modifies the config file at 10:00:02. Agent A makes a decision based on the old config at 10:00:03. Agent A's decision is logically correct based on what it "saw" — but it's based on stale state. The result is contradictory actions: Agent A acts as if feature X is disabled, Agent B acts as if feature X is enabled.
Fix: Shared state store with version vectors: before acting, each agent reads the latest state version. If the version has changed since the agent last read, it must re-read and re-plan. Use a shared "blackboard" pattern: all agents read and write to a central state store with optimistic concurrency control. The orchestrator validates that agent actions are consistent with the current state before executing them.
Common Design-Time Mistakes
Those gotchas emerge when agents interact. These design mistakes happen before a single agent call is made — during the architecture phase — and they determine whether your multi-agent system scales or collapses under its own complexity.
Starting with too many agents
The team designs a 6-agent orchestration pipeline before validating that 2 agents outperform 1. Each additional agent adds latency (sequential LLM calls), cost (more tokens), and debugging complexity (more interactions to trace). Start with a single agent. Add a second only when you have evidence (not intuition) that task decomposition improves output quality. Validate with your eval suite before adding agents.
No shared context management
Agent A researches a topic and produces findings. Agent B starts coding — but can't see what Agent A found because there's no shared workspace. Agent B re-researches the same topic, wasting time and tokens. Design a shared scratchpad or context store that all agents can read from and write to. Every agent's output should be immediately visible to every other agent.
No circuit breaker for runaway costs
A multi-agent pipeline processes a user request. Due to a reasoning loop, Agent B calls Agent C 47 times. Total cost for one request: $8. Without a per-request cost ceiling, you discover this from your monthly invoice. Implement hard spending caps per request (max_tokens_per_request), per user (daily_budget_per_user), and per pipeline run. Kill the pipeline and return a graceful fallback when the ceiling is reached.
Evaluating agents individually, not end-to-end
Each agent passes its individual eval: the Researcher finds relevant info 90% of the time, the Coder produces working code 85% of the time, the Reviewer catches 80% of issues. But end-to-end pipeline quality is 90% × 85% × 80% = 61.2%. Individual quality doesn't compound — it degrades multiplicatively. Build end-to-end evals that measure final output quality against human baselines.
Tightly coupled agent interfaces
Agent A's output format changes slightly (adds a new field). Agent B can't parse it. The entire pipeline breaks. Design agent interfaces as explicit contracts (JSON schemas or Protocol Buffers). Version them. Test backward compatibility in CI. Agents should be independently deployable — like microservices, not monolith modules.
Multi-Agent Frameworks
Framework Architecture Key Strength Maturity
LangGraph Graph-based workflows Flexible state machines High
AutoGen Conversational Multi-turn agent chat High
CrewAI Role-based teams Simple mental model Medium
MetaGPT Software team simulation SOP-driven coordination Medium
Swarm (OpenAI) Lightweight handoffs Minimal orchestration Experimental/educational
Key Takeaways
Multi-agent systems specialize, coordinate, and cross-check -- just like human teams
The orchestration layer determines system effectiveness more than individual agent quality
Start with two agents with clear roles before scaling to larger systems
Cross-validation (debate, voting, review) can meaningfully reduce errors compared to single agents -- Microsoft Research's paper on LLM debate showed that multi-agent discussion improved accuracy on reasoning benchmarks
State management across agents is the hardest engineering problem -- invest early
Monitor per-agent costs; multi-agent systems can 3-5x your LLM spend if uncontrolled
🎯 Real-World Decision: What Would You Do?
You're building an automated code review system. A PR comes in with 500 lines of changes across 8 files. You need to check for bugs, security vulnerabilities, performance issues, test coverage, and style compliance.
Option A: One agent reviews everything with a comprehensive prompt
Option B: 5 specialized agents (bug hunter, security scanner, perf reviewer, test critic, style checker) running in parallel, orchestrator merges feedback
Option C: 2 agents — one reviews code, the other reviews the first agent's feedback for false positives. Then a human reviews the final output.
Option B sounds impressive but costs 5x more and often produces conflicting feedback. Option C is the sweet spot — cross-validation catches the worst false positives, total cost is 2x not 5x, and the human final review builds trust. Start with 2 agents, prove value, then split roles. What would you build?
Quick Reference Card
Bookmark this — multi-agent system decisions at a glance.
Component Start With Scale To
Agent count 2 agents with clear roles 3-5 after proving 2 > 1
Communication Shared memory (simple) Message passing (async)
Orchestrator Hardcoded sequence LLM-based routing
Validation Agent B reviews Agent A Debate or voting for critical tasks
State Shared dict/database Event-driven with conflict resolution
Cost tracking Per-agent token counters Budget allocation per agent
Framework LangGraph or CrewAI Custom when framework limits hit
Warning sign: If your multi-agent system doesn't outperform a single well-prompted agent, you've added complexity without value. Always benchmark.
What's Next?
Multi-agent systems often need access to structured, production-grade data. Feature stores provide the infrastructure to serve consistent, versioned features to both training and inference pipelines — critical for ML-powered agent capabilities.
📚 System Design Deep Dive Series
This is post #5 of 20 in the System Design Deep Dive series.
Previously: AI Agent Architecture ← | Up next: Feature Store Architecture → | Full series index →
If you found this useful, follow and share it with your team. Building these deep dives takes serious effort — your support keeps the series going.