Most multi-agent systems are a meeting that should have been an email.
That's the contrarian opening, and I'll spend the rest of the post earning it, because the pattern is genuinely powerful in a narrow band and genuinely wasteful everywhere else. The pitch is seductive: instead of one generalist agent struggling across every domain, build a team — a research agent, an analysis agent, a writer agent — each specialized, coordinating toward one goal. A little org chart of LLMs. The collective outperforms any soloist. Synergy.
Sometimes that's exactly right. Often it's a committee you assembled to answer a question one good agent could have handled alone, at fifteen times the cost. Let's separate the two.
Three ways agents work together
When agents collaborate, they do it in one of three shapes, and the survey literature (Multi-Agent Collaboration Mechanisms is the 2025 map) keeps coming back to these:
- Sequential handoffs — agent A finishes, passes to B, B to C. A pipeline of specialists. This is prompt chaining with the steps promoted to full agents.
- Parallel workstreams — independent agents work different strands at once, then results merge. This is parallelization with the branches promoted to full agents.
- Hierarchical delegation — a lead agent plans, spins up specialist sub-agents, and synthesizes their work. This is the dominant production pattern, and it's the one worth building carefully.
The orchestrator-worker structure in code is shorter than its reputation suggests — it's a planner, a parallel fan-out, and a synthesis pass:
async def orchestrate(question):
subtasks = lead_plan(question) # lead decomposes: 3-5 subtasks
findings = await asyncio.gather(*(
worker(task) for task in subtasks # specialists run concurrently
))
report = lead_synthesize(question, findings) # lead writes the answer
return citation_pass(report) # separate verification pass
Frameworks give you this with batteries: CrewAI models role-based "crews" that map to job titles, AutoGen runs conversational group chats where a manager picks the next speaker, and LangGraph wires explicit supervisor/handoff graphs. Pick by how you think about the problem — roles, conversation, or control flow — not by which has the shiniest README.
The one number that should scare you
Anthropic's writeup of their multi-agent research system is the most honest public account, and it comes with two numbers you should tape to your monitor. The good one: their orchestrator-worker setup beat a single agent by 90.2% on an internal research eval. The sobering one: it burned roughly 15× the tokens of a single chat.
Fifteen times. That's not a rounding error you optimize away later — it's the defining economic fact of the pattern. Multi-agent systems are expensive in a way that compounds: every sub-agent has its own context, its own reasoning, its own back-and-forth, and the lead has to coordinate all of it. The same report mentions agents merrily spinning up fifty sub-agents for a simple query when left unchecked. Their main lever for control wasn't a clever architecture — it was prompt engineering, telling agents how much effort to spend and when to stop.
So the question before you build a team isn't "would multiple agents help?" Almost anything is helped by throwing more compute at it. The question is "is this task worth 15×?" For a high-value research report, maybe yes. For summarizing a document, absolutely not — you just built a bureaucracy to do a clerk's job.
When the committee actually beats the soloist
The clean dividing line, and Anthropic states it plainly: multi-agent shines on parallelizable work and struggles on tightly-coupled work.
Research is the poster child. Breadth-first questions — "survey the competitive landscape," "gather evidence from twenty sources" — decompose naturally into independent sub-tasks that don't need to talk to each other while they run. Hand each to a sub-agent, run them in parallel, synthesize. The work was separable, so separating it across agents costs you little and buys you breadth.
Now the opposite. Writing one coherent codebase is tightly coupled — every file makes assumptions about every other file, and three agents writing three modules in isolation produce something that doesn't compile, let alone cohere. Anthropic found exactly this: their system was weak at tasks where pieces depend heavily on each other. The coordination overhead of keeping everyone in sync exceeds the benefit of dividing the labor. Some work is a relay race; some work is a conversation, and you can't split a conversation across a committee without losing the thread.
The 2025 Beyond the Strongest LLM results sharpen this further — multi-turn orchestration can rival or beat the single strongest model on certain benchmarks, but "certain" is load-bearing. It's not a free upgrade. It's a trade that pays off on the right problem shape and loses on the wrong one.
Debate, the other shape
One more variant worth knowing, because it's not orchestrator-worker at all. In multi-agent debate, several agents argue toward an answer — propose, critique each other, revise, converge. The disagreement is the feature; forcing models to defend and attack positions surfaces errors that a single confident pass glides right past. It's reflection turned into a group activity, and for reasoning-heavy problems where one model's blind spot is another's obvious catch, it can pull answer quality up. It also multiplies the cost again, so the same 15× discipline applies.
Build a team when the work genuinely splits into specialist parts that can run apart — research, broad analysis, multi-source synthesis. Keep it a soloist when the pieces depend on each other, or when one capable agent already handles the job. The failure mode isn't using multi-agent systems. It's reaching for a committee out of enthusiasm, paying fifteen times over for the privilege, and ending up with five agents distracting each other on a task that wanted one.
Leave a Reply
Your email address will not be published.