The first month, the agent cost $40 to run. It was great, so you shipped it. The third month, the invoice said $11,000 and someone in finance learned your name.
Nothing went wrong, exactly. The agent worked. It just worked expensively, on every request, with no notion that some questions are easy and some are hard and you're paying frontier-model prices for both. Cost, latency, and tokens aren't accounting details you reconcile after the fact — they're constraints the agent should optimize against while it runs, the same way it optimizes for being correct. An agent that ignores its own budget is a correct agent you can't afford to keep on.
The three budgets, and how they fight
You're spending three resources at once, and they don't trade off cleanly.
Tokens are the raw fuel — every token in the prompt and every token generated has a price, and agents are token gluttons because they loop, re-reading their whole history each step. Latency is the wall-clock wait, which compounds brutally: a 10-step agent where each LLM call takes 3 seconds is a 30-second wait, and a chatbot nobody will tolerate. Cost is dollars, mostly downstream of tokens and model choice.
The fights are real. Want lower latency? Run steps in parallel — but now you might do work you didn't need, spending more tokens. Want lower cost? Use a smaller model — but it might need more retries to get it right, eating the savings and adding latency. There's no free lunch. There's only being deliberate about which resource you're willing to spend to save another.
The single biggest win: don't use the big model for easy work
Here's the one that pays for itself fastest. Most requests are easy. You're paying frontier prices for all of them. Stop.
The pattern is a model cascade, and it predates the current agent wave — FrugalGPT laid it out in 2023. Try to answer with a small, cheap, fast model first. Check whether the answer is good enough. Only if it isn't do you escalate to the expensive model. Most queries never reach the top tier, and your bill collapses toward the cost of the cheap model while your quality stays near the expensive one's — because the hard queries, the ones that actually need the big model, still get it.
The catch is the Good enough? check — escalation is only as smart as your ability to tell the cheap answer fell short. Sometimes that's structural (the small model failed to produce valid JSON, the tool call malformed). Sometimes you need a confidence signal or a lightweight grader. The routing version (RouteLLM and friends) flips it: classify the query's difficulty up front and send it to the right tier directly, skipping the "try small, then retry big" double-spend. Either way the principle holds — match the model to the job, not to your anxiety about quality.
The cheapest token is the one you don't spend twice
Caching is the other big lever, and there are two kinds people confuse.
Prompt caching caches the processing of a long, stable prefix. If every request starts with the same 8,000-token system prompt and tool definitions, you're paying to process those 8,000 tokens every single call. Providers now let you cache that prefix so it's processed once and reused — Anthropic shipped this in 2024 — cutting both cost and latency on the repeated part. For agents, whose prompts are enormous and mostly static turn to turn, this is close to free money.
Semantic caching caches the answer. If a user asks something semantically identical to a past question — "what's your refund policy" vs "how do refunds work" — you embed the query, find the near-duplicate, and return the cached answer with no model call at all. Tools like GPTCache do this. The risk is obvious and worth respecting: "near-duplicate" isn't "identical," and a too-loose similarity threshold serves a confidently wrong cached answer to a question that only looked the same. Tune the threshold like it can hurt you, because it can.
Give the agent a budget it can see
The subtle shift that separates a tuned system from a tuned agent: make the constraints part of the agent's own decision-making, not just an outer wrapper.
A budget-aware agent knows it has, say, a 10-step ceiling and 50,000 tokens to finish a task, and it plans accordingly — it doesn't explore six branches when it can afford two, doesn't re-read the whole history when a summary fits, doesn't call the expensive tool when a cheap one answers the question. You can hand it that budget explicitly and check it in the loop:
def step(state):
if state["tokens_used"] > state["budget"] * 0.8:
return summarize_and_finish(state) # running low → wrap up, don't sprawl
return continue_reasoning(state)
This matters most for the failure that quietly destroys budgets: the runaway loop. An agent that doesn't converge will happily burn ten thousand steps and a fortune in tokens chasing a goal it can't reach. A hard step-and-token budget isn't just cost control — it's the backstop that turns "infinite expensive loop" into "gave up cleanly after a bounded spend."
The rules I'd run on
After enough surprise invoices, the playbook shrinks to a few lines.
- Cascade by default. Cheap model first, escalate on a real signal, not on reflex. Most traffic never needs the top tier.
- Cache both layers. Prompt-cache the static prefix, semantic-cache the repeat answers — and keep the similarity threshold tight.
- Budget the loop. Hard caps on steps and tokens. The runaway agent is the expensive one, and it never announces itself.
- Measure per-request cost, not just the total. A $11k bill is a number. "$0.40 per resolved ticket, up from $0.04" is a diagnosis.
The goal was never "spend the least." A useless free agent saves you nothing. The goal is to spend on the requests that need it and stop spending on the ones that don't — and to build the agent so it knows the difference itself, instead of finding out from finance.
Leave a Reply
Your email address will not be published.