Do the arithmetic on a slow agent and the answer is almost always the same: it's slow because it's waiting in line.
Say an agent answers a research question by checking four sources — news, filings, a price feed, and an internal database — and each call takes about two seconds. Run them one after another and your user waits eight seconds before a single word of synthesis begins. But those four lookups don't depend on each other. None of them needs another's output. There is no reason on earth they should run in series, and yet the naive agent runs them in series because that's how the code reads top to bottom.
Parallelization is the pattern that fixes the dumbest source of latency in agent systems: waiting for things that didn't need to wait.
Sequential time is a sum. Parallel time is a max.
That's the whole idea in one line. Run N independent operations sequentially and total time is the sum of their durations. Fire them concurrently and total time drops to roughly the slowest single one — plus whatever you do to combine the results. Four two-second calls go from eight seconds to a hair over two.
It bites hardest when the work is I/O-bound, which agent work almost always is. An LLM call, an API request, a database query — these spend nearly all their time waiting on a network round trip, not burning CPU. Waiting is exactly the thing computers can overlap for free.
Two flavors: sectioning and voting
Anthropic's Building Effective Agents splits parallelization into two shapes worth keeping straight.
Sectioning is what the diagram shows: different sub-tasks, all independent, run at once. Summarize four different documents. Check flights and hotels and events for the same trip. Each branch does its own distinct thing, and you stitch the pieces together at the end.
Voting is the other shape: the same task, run several times, then aggregate. Ask three independent passes "is this code safe to merge?" and take the majority. Generate five drafts and pick the best. You're not dividing the work — you're buying redundancy, trading tokens for reliability. Different goal, same concurrency primitive underneath.
The code is shorter than the explanation
In Python this is one standard-library call. asyncio.gather launches everything and waits for all of it:
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def summarize(doc: str) -> str:
msg = await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=400,
messages=[{"role": "user", "content": f"Summarize in 3 bullets:\n{doc}"}],
)
return msg.content[0].text
async def main(docs: list[str]) -> list[str]:
# fan-out: every summary is in flight at once
return await asyncio.gather(*(summarize(d) for d in docs))
summaries = asyncio.run(main(docs)) # wall-clock ≈ the slowest single call
Frameworks wrap this in nicer clothes. LangGraph's C0 API fans out a dynamic number of branches and reduces them back; LlamaIndex Workflows have a fan-in/fan-out primitive with a worker cap; the OpenAI Agents SDK has a parallel-agents cookbook doing gather under the hood. They're all the same move dressed differently: launch, wait, combine.
If you want this pushed to its limit, LLMCompiler plans an entire set of tool calls as a parallel DAG instead of issuing them one at a time, and reports up to 3.7× lower latency than a step-by-step ReAct loop. That's the ceiling of the idea: don't just parallelize the calls you happen to write next to each other — figure out the dependency graph and parallelize everything that can be.
Where the speedup quietly leaks away
Parallelization has a reputation as free speed. It isn't.
The fan-in is a real step. Total time isn't zero — it's the slowest branch plus the synthesis that waits on all of them. One slow source caps your entire speedup. If three branches finish in 200ms and the fourth takes four seconds, congratulations, your "parallel" agent takes four seconds.
N calls at once is N times the spend at once. Sequential work spreads cost over time; parallel work spikes it. Fire fifty concurrent requests and you'll meet your provider's rate limiter, get throttled, and watch your beautiful fan-out collapse into a retry storm. Cap concurrency. Add backoff. A semaphore around gather is not optional at scale — it's three lines that keep you from DOSing your own provider:
sem = asyncio.Semaphore(8) # at most 8 calls in flight
async def bounded(doc):
async with sem:
return await summarize(doc)
Now a thousand documents fan out eight at a time instead of all at once. You give up a little peak parallelism and get back a system that doesn't fall over. The voting variant has the same shape and the same cost profile: running the same check three or five times to take a majority is genuinely more reliable, but you're paying 3–5× the tokens for that confidence, so spend it where a wrong answer is expensive — a safety classifier, a merge gate — not on every request out of habit.
Partial failure gets weird. When one of four branches throws, what does the whole thing return? Three good results and a hole? A total failure? You have to decide on purpose, because the default — one exception killing the entire gather — is rarely what you want.
The honest summary: parallelize the independent parts, keep the synthesis sequential, and respect that you've traded a simple linear trace for a concurrent one that's harder to debug. Most real workflows end up as a mix anyway — fan out to gather, fan in to think. The art is knowing which half is which, and not paying network latency for steps that could have shared it.
Leave a Reply
Your email address will not be published.