Gen AI Foundations·Jun 10, 2026·6 minllm deeplearning

Cost and Latency Engineering

The cheapest, fastest token is the one you never generate. Strip the jargon off every production LLM optimization and most of them are that one sentence applied from a different angle. So before reaching for a fancier technique, it's worth getting the cost model straight, because the model tells you which lever is actually pulling your bill and your p99 — and they're usually not the same lever.

The cost model, honestly

Two facts drive almost everything.

You pay per token, and output tokens are pricier than input tokens — typically several times more. So a verbose model that pads every answer costs you on both axes: more output tokens directly, and the user waiting for all of them.

Latency is dominated by output, not input. Recall how generation works: the input prompt is processed in a single parallel pass (prefill), but output tokens come out one at a time, each needing a full forward pass (decode). That gives you two numbers that matter more than "total time":

A 2,000-token answer isn't slow because the prompt was long. It's slow because the model ran the decode loop 2,000 times. Internalize that and the levers sort themselves into "make the prompt cheaper" (TTFT) and "generate fewer/faster tokens" (TPOT and cost).

Lever 1: cache the prompt you keep resending

Most production prompts have a large static prefix — a system prompt, tool definitions, a few-shot block, a long document — followed by a small variable part. Without caching, the model re-processes that entire prefix on every single call. That's prefill you're paying for over and over to compute the identical thing.

Prompt caching stores the model's internal state (the KV cache) for a prefix so subsequent calls reuse it instead of recomputing. Providers exposed this directly — Anthropic's prompt caching, for one — and the win is lopsided: cached input tokens cost a fraction of fresh ones, and TTFT drops because prefill is mostly skipped. The rule is structural: put everything stable at the front of the prompt, everything variable at the back. Caches key on a prefix, so a single early token that changes per request busts the whole thing. This is often the single highest-ROI change available, and it's a prompt-ordering edit, not an architecture project.

Lever 2: don't send every request to the biggest model

Your traffic is not uniformly hard. Some requests are "classify this as spam or not"; some are "write a migration plan for this codebase." Sending both to a frontier model means overpaying enormously on the easy majority.

A model cascade (the idea FrugalGPT, 2023, made concrete) runs the cheap, small model first and only escalates to the expensive one when needed. The trick is the escalation signal: a confidence score, a self-check, a validator, or a cheap classifier that decides "this one's hard." Tune the threshold and you can route the easy 70% to a model that costs a fraction as much, reserving the frontier model for the cases that earn it.

A request flows through a cache and a cheap model before escalating to a frontier model.
Cache the prefix, answer easy cases with a cheap model, escalate only when needed.

Lever 3: a small task-tuned model often beats a frontier one

For a narrow, well-defined task — a specific classification, a specific extraction, a specific format — a small model fine-tuned on that task frequently matches or beats a giant general model, at a tiny fraction of the cost and latency. The frontier model is paying for breadth you aren't using. If 80% of your volume is one repetitive task, that task is a candidate for a small dedicated model, with the big model as fallback. (This is the production payoff of the fine-tuning ladder — earlier posts cover how to build that adapter.)

Lever 4: speculative decoding for latency

This one attacks TPOT directly and it's clever. A small, fast draft model proposes several tokens ahead; the big target model then verifies them all in a single parallel forward pass, accepting the longest prefix that matches what it would have generated anyway. Because verification is parallel and cheap relative to generating one-at-a-time, you get the big model's exact output distribution at materially lower latency. Leviathan and colleagues, and a parallel DeepMind result, both landed this in 2022–2023, and it's now standard in serving stacks. Critically, it's lossless — the output is identical to what the target model would have produced, so it's free quality-wise. You usually get it by configuration, not by writing it yourself.

Lever 5: just generate less

The least glamorous lever and often the biggest:

Measure before you optimize

The trap is cargo-culting a technique that doesn't fit your workload. Speculative decoding does nothing for a cost problem dominated by re-sent prompts; caching does nothing for a latency problem dominated by a 4,000-token output. So measure first, and measure the right things:

There's no universal answer, and anyone who hands you one without asking about your traffic is guessing. Profile your own workload, find which lever your bottleneck responds to, pull that one, and re-measure. The techniques are all in service of the same blunt idea from the top: the best token is the one you never had to generate — so spend your engineering on not generating the ones you don't need, and on never recomputing the ones you already have.

Leave a Reply

Your email address will not be published.