Why LLMs Hallucinate

Hallucination is not the model malfunctioning. It's the model performing — flawlessly — the exact task we trained and graded it to perform. That framing sounds like a dodge, the kind of thing you say to excuse a flaw. It isn't. It's the most useful thing to understand about the problem, because it tells you that you can't prompt your way out of something that's baked into the objective. You have to change the incentives.

Let me build the argument from the bottom.

First, a definition worth being picky about

A hallucination is output that is fluent, confident, and wrong — a fabricated citation, a plausible-but-fake API method, a birth date for a person who has none on record. The fluency and the confidence are the dangerous part. A model that said "I'm not sure, but maybe..." before every shaky claim would be annoying, not dangerous. The failure isn't being wrong. It's being wrong in the exact register the model uses when it's right.

People reach for easy explanations: not enough training data, bad retrieval, too small a model. Each is sometimes a contributing factor. None is the root. Hallucination shows up in the biggest models trained on the most data, and it shows up on questions where the right answer was definitely in the training set. So the cause has to be more fundamental than "didn't see enough examples."

Root cause one: the objective rewards plausibility, not truth

A base model is trained to predict the next token. Nothing in that objective contains a notion of true. It contains a notion of likely. The model learns the statistical shape of language — what tends to follow what — and an invented-but-typical continuation often scores beautifully on that metric.

Ask for a citation. In the training data, citations have a shape: an author, a year, a plausible-sounding title, a venue. The model has learned that shape cold. So when you ask about a paper it doesn't actually "know," producing a citation-shaped string is the highest-likelihood move. Smith et al., 2019, "Attention-Guided Retrieval for Open-Domain QA," EMNLP — perfectly formed, completely invented. The model didn't lie. It pattern-matched, which is the only thing the objective ever asked of it. There's even a formal result here: Kalai and Vempala (2023) argued that a language model which is properly calibrated on its training distribution must hallucinate at some nonzero rate on certain classes of facts — the very statistical fidelity we want creates fabrication as a side effect.

Root cause two: we grade like a multiple-choice exam

This is the part the 2025 OpenAI paper Why Language Models Hallucinate (Kalai, Nachum, Vempala, and Zhang) drove home, and it reframed the whole conversation. Their argument is almost embarrassingly simple once you hear it.

Think about how a student treats a multiple-choice test with no penalty for wrong answers. Leaving a question blank scores zero. Guessing scores zero on average but sometimes hits. So the rational test-taker, when unsure, always guesses. Abstaining is strictly dominated.

Now look at how we evaluate language models. Most benchmarks are graded binary: the answer matches the key (1 point) or it doesn't (0 points). Crucially, "I don't know" scores the same zero as a wrong answer. Under that grading scheme, a model that abstains when uncertain will always lose to a model that guesses confidently, because guessing recovers some points and honesty recovers none. We then take the model that scores highest — the confident guesser — and crown it state of the art.

Why hallucination happens: an uncertain query with no reward for abstaining yields a confident guess. — When abstention scores the same as a wrong answer, the model is trained to guess.

So hallucination isn't only a pretraining artifact. It's an incentive we install during evaluation and reinforce during fine-tuning. We optimize models to be excellent test-takers, and an excellent test-taker bluffs when the blank pays nothing. The model is being honest about its training signal. We're the ones who set the grading curve.

Why retrieval helps but doesn't cure

RAG gets a lot of credit for "fixing" hallucination, and it does move the number. If the answer is sitting in the retrieved context, the highest-likelihood continuation is to quote it, so grounding raises the floor.

But it caps out below 100%, for reasons that follow directly from the two root causes. Retrieval can fail to surface the right passage, and then the model is back to guessing — only now it's guessing while appearing grounded, which is worse. And even with the right passage in context, a model still trained to prefer a confident answer over an admission of uncertainty will sometimes override the source or fill a gap the source left open. Grounding changes what the model conditions on. It doesn't change what the model was rewarded for.

What actually moves the needle

Because the cause is incentives, the cures are about incentives and calibration, not clever prompts:

Grade abstention as better than a confident error. Change the eval and the fine-tuning reward so "I don't know" outscores a wrong guess. This is the lever the 2025 paper points at directly: fix the scoreboard and you change the behavior the model optimizes toward.
Ask for calibrated confidence. Have the model express uncertainty and threshold on it. A well-calibrated "60% sure" is far more useful than a flat assertion, and you can route low-confidence answers to retrieval or a human.
Ground, then constrain to the source. RAG plus instructions to answer only from provided context, with an explicit "if it's not here, say so."
Verify after the fact. A second pass — another model or a tool — that checks claims against sources catches the confident fabrications the first pass produced.

None of this makes hallucination go to zero, and you should be suspicious of anyone selling that. A system that generalizes beyond its training data will, by the math, sometimes generate plausible falsehoods — that tension between coverage and fabrication doesn't have a free lunch.

The mindset shift is the whole payoff. Stop treating hallucination as a defect to be patched and start treating it as the predictable output of an objective that rewards confident plausibility and a scoreboard that punishes honesty. Change the objective, change the scoreboard, and the behavior follows. Leave them alone, and no prompt in the world will talk the model out of doing exactly what you paid it to do.

First, a definition worth being picky about

Root cause one: the objective rewards plausibility, not truth

Root cause two: we grade like a multiple-choice exam

Why retrieval helps but doesn't cure

What actually moves the needle

Leave a Reply