Agents: Core·Sep 5, 2025·5 minagents llm engineering

Prompt Chaining

Give one prompt five jobs and it will quietly do four. Ask a model to "read this earnings report, summarize it, pull the three biggest trends with their numbers, draft an email to the CFO, and keep it under 150 words," and somewhere in that wall of instructions a requirement slides off the table. Usually the one you cared about most — the numbers come back rounded to mush, or the email runs to 240 words.

Prompt chaining is the unglamorous fix. Instead of one prompt hauling the whole load, you split the work into a line of smaller prompts where each step's output becomes the next step's input. Step one extracts. Step two summarizes what step one found. Step three drafts from that summary. An assembly line, not a hero.

Why smaller steps win

A prompt that does exactly one thing is hard to misread. That's the entire argument, and it's a strong one. Each focused step is more accurate because it has less to juggle, easier to debug because you can read its output in isolation, and easier to tune because you can rewrite one stage without disturbing the rest.

There's a second benefit people miss: between two LLM calls you can run ordinary code. Validate a field. Branch on a value. Call a database. Throw away a malformed result and retry. The chain isn't only LLM steps — it's LLM steps with plain, testable software stitched into the gaps, and that software is where most of your reliability actually comes from. Anthropic's Building Effective Agents names chaining as the first workflow pattern for exactly this reason: a task "decomposed into a sequence of steps, where each LLM call processes the output of the previous one."

Prompt chain: a raw report flows through extract facts, a valid-JSON gate, summarize, and draft email to the final output
Each arrow is a typed hand-off; the gate loops a bad extraction back before it can poison the rest of the chain.

The hand-off is the whole game

Here's the part that decides whether your chain works: the thing that travels between steps. Pass vague prose from one prompt to the next and you've built a game of telephone — step three interprets step two's flowery paragraph however it likes, and you're back to dropped requirements, just spread across three calls instead of one.

So don't pass prose. Pass a typed object. Force each step to emit JSON that matches a schema, validate it, and only then hand it on. This single discipline is the difference between a chain that's more reliable than a monolithic prompt and one that's merely more expensive.

from pydantic import BaseModel
import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

class Trends(BaseModel):
    trends: list[str]
    metrics: dict[str, float]

# Step 1: pull structured facts out of a messy report
facts = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    response_model=Trends,        # validated before it leaves the step
    messages=[{"role": "user",
               "content": f"Find 3 trends and their numbers:\n{report}"}],
)

# Step 2: the next call only ever sees clean, typed data
email = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user",
               "content": f"Write a 5-line CFO update from: {facts.model_dump_json()}"}],
)

response_model makes instructor re-prompt the model until the output parses into Trends. Step two never sees a half-formed sentence — it sees a dict it can trust. If you're on LangGraph, the same idea shows up as nodes and edges over a shared state object, and the official workflow guide walks a generate → gate → improve → polish chain end to end. Different machinery, identical principle: typed hand-offs, validated at the seam.

One more thing those gaps buy you: somewhere to look when things go wrong. Because each step is a discrete call, each is a natural trace span. You can see precisely which stage got slow, which one started returning junk, which one your last prompt tweak actually moved. A monolithic prompt is a single black box — when its output degrades you're left squinting at the whole thing, guessing which of five jammed-together instructions broke. A chain is a row of labeled boxes, and the one that turned red is your bug. In production, that difference is the investigation; it's the reason chains are easier to keep alive than the giant prompt they replaced.

Where it stops paying off

Chaining is not free, and the costs are easy to ignore until the bill lands.

Every step is another round trip. A five-step chain is five LLM calls, five times the latency, five times the spend. If one capable model can do the task reliably in a single call, chaining it into pieces is pure overhead — you've added four network hops to feel organized. The October 2025 CompactPrompt work pushes on this from the cost side, trimming token usage up to ~60% on financial-QA chains; worth a read once your chain is long enough that the invoice gets your attention.

A bad early step poisons everything after it. Garbage out of step one is garbage into steps two through five. The validation gates aren't decoration; they're the thing that stops one bad extraction from quietly corrupting a three-step pipeline.

And only chain steps that genuinely depend on each other. This is the most common mistake. People chain steps that have no business being sequential — "search source A, then source B, then source C" — when those searches don't need each other at all. That's not a chain, it's a queue you built by hand, and it should run concurrently. (That pattern gets its own post.) Chaining is for dependent steps. The moment two steps don't read each other's output, the chain is the wrong tool.

Chaining is what you reach for when one call keeps fumbling the handoff between sub-tasks — and what you rip out the week a single model grows capable enough to hold the whole job in its head. Both moves are correct. Knowing which week you're in is the actual skill.

Leave a Reply

Your email address will not be published.