Using one language model to grade another feels like asking the fox to audit the henhouse and trust the report. The strange thing is that it works often enough to be one of the most useful tools in the eval kit — and it fails in ways subtle enough to quietly corrupt every number you report. Both are true. The skill is knowing which situation you're in, and the only way to know is to treat the judge as a model you must also evaluate.
Why anyone does this
Human evaluation is the gold standard and it doesn't scale. You can't put a person on every output of a system serving thousands of requests an hour, and you certainly can't do it on every commit. So the appeal of an LLM judge is obvious: it's fast, it's cheap relative to humans, and — this is the part that makes it viable — on many tasks it agrees with human raters at a rate that's genuinely useful.
The 2023 MT-Bench / Chatbot Arena work by Zheng and colleagues put numbers on this. A strong judge model, asked to compare two responses, agreed with human preferences at roughly the rate that two humans agree with each other — around 80%. That's the result that launched a thousand eval pipelines. If the judge agrees with people about as often as people agree with people, then for triage, ranking, and regression-catching, it's good enough to lean on.
"Good enough to lean on" is not "trustworthy by default," though. The same body of work catalogued the ways judges go wrong, and they're systematic, not random.
The biases, and they're real
A random error you can average away. A bias skews every score in one direction, and LLM judges have several documented ones:
- Position bias. Show a judge response A then response B, and it favors one slot — often the first — regardless of content. Wang and colleagues (2023) titled their paper Large Language Models are not Fair Evaluators and showed you can flip the verdict just by swapping the order. If your eval always lists the candidate before the reference, you've baked a thumb on the scale.
- Verbosity bias. Judges tend to rate longer answers as better, even when the extra length is padding. A model that rambles confidently can out-score a tighter, more correct response.
- Self-preference. A judge tends to favor outputs that resemble its own style — sometimes literally rating text from its own model family higher. Grading your model's outputs with the same model is a conflict of interest worth naming.
- Leniency / sycophancy. Absent a sharp rubric, judges drift toward generous scores. Everything is a 7 or 8. The scale compresses and stops discriminating.
None of these mean the judge is useless. They mean a naive judge — "here are two answers, which is better?" — is measuring its own biases as much as the responses.
How to make a judge you can trust
The fixes are mostly about removing the judge's freedom to be sloppy. In rough order of impact:
Prefer pairwise over pointwise. Asking "is A better than B?" is more reliable than asking "rate A from 1 to 10," because relative judgments are more stable than absolute ones. A judge struggles to consistently anchor "this is a 7," but it can usually tell which of two answers is stronger.
Swap positions and average. Run every comparison twice, A-then-B and B-then-A, and only count it if the verdict is consistent. Disagreement between the two orderings is position bias caught in the act — treat those as ties or escalate them. This one trick neutralizes the single biggest failure mode.
Give a rubric, not a vibe. "Rate helpfulness 1–5" invites leniency. "Score 1 if it answers the question with no factual errors, 0 otherwise; a factual error is X, Y, Z" gives the judge something to hold onto. The more the rubric spells out what each score means with concrete criteria, the less the judge falls back on length and style.
Make it reason first. G-Eval (2023) showed that having the judge produce a chain of reasoning before its verdict improves alignment with humans — the judge that explains itself judges better than the judge that blurts a number. Structure the output so the rationale comes before the score.
Anchor to humans. This is the step people skip and shouldn't. Hand-label a few dozen cases, run your judge on them, and measure agreement. If the judge and your humans agree 85% of the time, you can trust the judge's verdicts at roughly that confidence. If they agree 55%, your judge is a coin flip in a lab coat and every downstream number it produced is noise. You can't know which without the human anchor set.
When it lies
Even a well-built judge has a competence ceiling, and it's exactly where you most want help. A judge can only reliably evaluate what it could do itself. So:
- Niche factuality. If the correct answer depends on domain knowledge the judge doesn't have — your internal systems, a specialized field, fresh facts past its training — it will grade fluent-and-wrong as correct, because it can't tell the difference any better than the model it's judging. The fox and the hen took the same exam.
- Subtle reasoning errors. A judge often rewards an answer that sounds like sound reasoning over one that is sound, especially in long derivations it won't actually re-derive.
- Adversarial or borderline cases. Right at the decision boundary — where you most need a reliable signal — judges are least consistent and biases bite hardest.
For these, the judge is a filter, not a verdict. Use it to triage the easy 80%, route the hard cases and the low-confidence ones to a human, and never let a judge be the final word on something it couldn't have produced.
The honest summary is that LLM-as-judge is a measurement instrument, and you don't trust an instrument you haven't calibrated. Build it carefully — pairwise, position-swapped, rubric-driven, reasoning-first — and check it against human labels before you believe a single score it gives you. Do that and it'll carry most of your eval load cheaply. Skip the calibration and you've built a confident, biased, scalable way to be wrong about your own product. The judge is a model. Evaluate it like one.
Leave a Reply
Your email address will not be published.