The agent booked the flight. Then the hotel API timed out. Now you own a non-refundable ticket to a city you have no room in.
This is the failure mode that separates a demo from a system. A single LLM call that errors is trivial — you retry it. But an agent strings together actions that change the world: it charges cards, books rooms, sends emails, files tickets. When step four fails, steps one through three already happened. You can't just throw an exception and unwind the stack, because the stack includes a credit card charge that the finally block can't un-charge.
Agent reliability is less about preventing errors and more about what you do when something irreversible is already done.
Three kinds of wrong
Lumping all failures together is the first mistake. They want different responses.
Transient. The network blipped, the API returned 503, you hit a rate limit. Nothing is wrong with your logic — the world was briefly busy. Response: retry, with backoff. These resolve themselves if you just ask again in a moment.
Permanent. The tool returned a 400 because your arguments are malformed, or the record doesn't exist, or you lack permission. Retrying is pointless and sometimes harmful — you'll hammer an endpoint that will never say yes. Response: don't retry. Either fix the input (an agent can sometimes self-correct here) or fail loudly.
Semantic. The call succeeded, returned 200, and the result is wrong. The weather tool returned yesterday's forecast. The search came back empty. No exception fired — the failure is in the content, not the status code. Response: validate outputs, not just check for thrown errors. This is the one that slips through every naive try/except, because nothing was thrown.
Treat all three as "an error occurred, retry" and you'll retry the unretryable, accept the nonsense, and hammer dead endpoints. The classification has to come first.
Retry like you mean it
For the transient case, naive retry is its own bug. Retry instantly and you just add load to a system already struggling. Retry forever and a permanent failure becomes an infinite loop.
The grown-up version is exponential backoff with jitter — wait 1s, then 2s, then 4s, with a little randomness so a thousand agents retrying in lockstep don't form a thundering herd that re-knocks-over the service the instant it recovers. And cap it. Three to five attempts, then give up and escalate.
import random, time
def with_retry(fn, attempts=4, base=1.0):
for i in range(attempts):
try:
return fn()
except Transient as e:
if i == attempts - 1:
raise # out of retries → escalate
time.sleep(base * 2**i + random.random()) # backoff + jitter
except Permanent:
raise # don't retry; fix or fail
Wrap the whole tool, not just the network call, in a circuit breaker too: if a dependency has failed its last N calls, stop trying for a cooldown window and fail fast instead of making every request wait out the full timeout. A flailing dependency shouldn't get to slow down your entire agent.
The saga: undoing what can't be undone
Back to the flight and the missing hotel. Retries don't help — the hotel genuinely has no rooms. So you're holding a flight you no longer want. What now?
The pattern is forty years old and predates all of this: the saga, from a 1987 database paper. The idea is that when you can't get a true transaction across multiple independent services — and you never can, across separate APIs — you define a compensating action for each step. Not a rollback (you can't roll back a charge), but a semantic undo: the action that makes up for it. Charged the card? Compensation is "issue a refund." Booked the flight? Compensation is "cancel within the free window." Sent the confirmation email? Compensation is "send a correction."
When a step fails, you walk backward through the steps that already succeeded, running each one's compensation in reverse order.
The agent ends in a consistent state — no flight, no hotel, no charge — instead of the torn, half-committed mess that naive error handling leaves behind. The work to make this real is unglamorous: every world-changing action needs a defined compensation written before you let the agent take it, and your framework needs to remember which steps committed so it knows which to undo. LangGraph's durable execution model and orchestrators like AWS Step Functions exist largely to hold that state for you, so a crash mid-saga doesn't lose track of what's been done.
There's a sharp edge hiding in the retry layer here, and it's worth naming: idempotency. Suppose "book flight" actually succeeded but the response got lost on the way back, so your retry logic fires it again. Now you've booked two flights and the saga doesn't even know it. The fix is to send an idempotency key — a unique ID per logical action — so the flight API recognizes the repeat and returns the original result instead of booking twice. Retries and compensations both assume each action happens exactly once; without idempotency keys, "retry" quietly becomes "do it again," and your careful recovery logic is now cleaning up a mess it created.
Recovery is a state, not a catch block
The deeper shift: in an agent, recovery isn't a line of code, it's a place the system can be. A resilient agent has an explicit error state with somewhere to go from it — re-plan around the obstacle, try a different tool, compensate and abort cleanly, or stop and ask a human. The worst outcome isn't failing. It's failing silently and continuing as if the failed step worked, building three more steps on a foundation that isn't there.
So the rule I'd carve over the door: before you let an agent take an action it can't take back, you must already know how it'll clean up if the next step falls over. No compensation defined, no irreversible action allowed. An agent that can do things to the real world without a plan for undoing them isn't autonomous. It's just unsupervised.
Leave a Reply
Your email address will not be published.