A language model does not write a sentence. It writes one token, throws away the fact that it was ever unsure, and writes the next one. Then it does that a few hundred more times. The fluency you read back is an illusion stitched together one guess at a time — and once you see the seam, a lot of model behavior stops being mysterious.
Let me walk through a single step, then the loop around it.
The atom: next-token prediction
Feed a model the string The capital of France is. Internally it doesn't see letters or even words. It sees tokens — sub-word chunks produced by a tokenizer. France might be one token; tokenization might split into token + ization. This matters more than people expect. It's why models miscount the letters in "strawberry" and why rare words cost more to generate: they're more tokens.
The model runs that token sequence through its stack and produces, for the next position, a vector of logits — one raw score per token in the vocabulary. A vocabulary of ~100k–200k tokens means ~100k–200k scores. Logits aren't probabilities yet; they're unbounded real numbers. A softmax turns them into a distribution:
p_i = exp(logit_i / T) / Σ_j exp(logit_j / T)
That T is temperature, and we'll get to it. After softmax you have a probability for every possible next token. Paris is high. London is low but nonzero. banana is microscopic but, crucially, not zero. The model now has to pick one. How it picks is decoding, and it's a choice you control — not a fixed property of the model.
The loop
Pick a token, append it to the input, run the whole thing again. This is autoregression: each output becomes part of the next input. The model has no plan, no draft, no lookahead beyond what its weights encode about likely continuations.
The loop ends when the model samples a special end-of-sequence token, or when you hit a max-token cap. Generation latency scales with the number of output tokens because every single one requires a full forward pass. Input tokens get processed in parallel; output tokens come out one at a time. That asymmetry is the root of half of all latency engineering.
Decoding: greedy is not "the truth"
The simplest decoder is greedy: always take the highest-probability token. Sounds safe. It isn't, for open-ended text. Greedy decoding gets stuck in loops and produces flat, repetitive prose — "I think that I think that I think that." Holtzman and colleagues showed in 2019 that maximizing probability at every step does not produce human-like text; human language is full of locally surprising choices that a greedy decoder would never make.
A common misconception is that temperature 0 (effectively greedy) gives you the model's "real" answer and everything else is noise. Not quite. Temperature 0 gives you the most probable continuation, which for factual lookups is usually what you want, and for anything creative is usually a dead end. There is no setting that reveals ground truth. There is only the distribution and how you sample from it.
The four knobs
Temperature (T). Divides the logits before softmax. T < 1 sharpens the distribution — the rich get richer, output gets more deterministic. T > 1 flattens it — long-shot tokens get a real chance, output gets wilder and, past a point, incoherent. T = 0 collapses to greedy. For extraction and classification I run near 0. For brainstorming, 0.7–1.0.
Top-k. Before sampling, keep only the k highest-probability tokens, zero out the rest, renormalize. k = 40 is a classic. It caps the worst-case weirdness but is blunt: a fixed k is too permissive when the model is confident (one token deserves 99% of the mass) and too restrictive when it's genuinely torn across many options.
Top-p (nucleus). The fix for top-k's bluntness, from the same 2019 Holtzman paper. Instead of a fixed count, keep the smallest set of tokens whose cumulative probability exceeds p (say 0.9), then sample from that nucleus. When the model is sure, the nucleus is one or two tokens. When it's uncertain, the nucleus widens automatically. This adaptivity is why top-p became the default sampler across most APIs.
Repetition penalties. A separate family — penalize tokens (or n-grams) that already appeared, to break the loops greedy decoding falls into.
Here's the loop made concrete with the transformers library, which has exposed these as plain arguments for years:
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("model-name")
model = AutoModelForCausalLM.from_pretrained("model-name")
inputs = tok("The capital of France is", return_tensors="pt")
out = model.generate(
**inputs,
do_sample=True, # off = greedy
temperature=0.7,
top_p=0.9,
top_k=0, # disabled; let top_p do the work
max_new_tokens=64,
)
print(tok.decode(out[0], skip_special_tokens=True))
Same model, same prompt, four different temperature/top_p settings: four different paragraphs, all "from the model." None is more authentic than the others. They're different walks through the same probability landscape.
Why this mental model pays off
Once you internalize "one token at a time, sampled from a distribution you tune," a pile of practical things click into place:
- Streaming works because tokens are produced sequentially — you can show them as they come.
- JSON breaks because a single low-probability token mid-structure derails the whole object, which is exactly why constrained decoding exists (a later post).
- Long outputs cost more and run slower in direct proportion to token count, never input size.
- "Make it less repetitive" is a decoding problem (raise temperature, add a penalty), not a prompting problem.
- Determinism is a setting, not a guarantee — and even at
T=0, floating-point and batching can introduce tiny nondeterminism on some hardware.
The transformer architecture underneath gets the headlines, and it earns them. But the thing that turns a trained network into something that talks is this unglamorous loop: score every token, squash to a distribution, roll the weighted dice, append, repeat. Generation isn't authorship. It's a very long sequence of small, tunable bets — and you're holding the dials.
Leave a Reply
Your email address will not be published.