Gen AI Foundations·May 30, 2026·5 minllm deeplearning

LLM-as-Judge

Using one language model to grade another feels like asking the fox to audit the henhouse and trust the report. The strange thing is that it works often enough to be one of the most useful tools in the eval kit — and it fails in ways subtle enough to quietly corrupt every number you report. Both are true. The skill is knowing which situation you're in, and the only way to know is to treat the judge as a model you must also evaluate.

Why anyone does this

Human evaluation is the gold standard and it doesn't scale. You can't put a person on every output of a system serving thousands of requests an hour, and you certainly can't do it on every commit. So the appeal of an LLM judge is obvious: it's fast, it's cheap relative to humans, and — this is the part that makes it viable — on many tasks it agrees with human raters at a rate that's genuinely useful.

The 2023 MT-Bench / Chatbot Arena work by Zheng and colleagues put numbers on this. A strong judge model, asked to compare two responses, agreed with human preferences at roughly the rate that two humans agree with each other — around 80%. That's the result that launched a thousand eval pipelines. If the judge agrees with people about as often as people agree with people, then for triage, ranking, and regression-catching, it's good enough to lean on.

"Good enough to lean on" is not "trustworthy by default," though. The same body of work catalogued the ways judges go wrong, and they're systematic, not random.

The biases, and they're real

A random error you can average away. A bias skews every score in one direction, and LLM judges have several documented ones:

None of these mean the judge is useless. They mean a naive judge — "here are two answers, which is better?" — is measuring its own biases as much as the responses.

How to make a judge you can trust

The fixes are mostly about removing the judge's freedom to be sloppy. In rough order of impact:

Prefer pairwise over pointwise. Asking "is A better than B?" is more reliable than asking "rate A from 1 to 10," because relative judgments are more stable than absolute ones. A judge struggles to consistently anchor "this is a 7," but it can usually tell which of two answers is stronger.

Swap positions and average. Run every comparison twice, A-then-B and B-then-A, and only count it if the verdict is consistent. Disagreement between the two orderings is position bias caught in the act — treat those as ties or escalate them. This one trick neutralizes the single biggest failure mode.

Give a rubric, not a vibe. "Rate helpfulness 1–5" invites leniency. "Score 1 if it answers the question with no factual errors, 0 otherwise; a factual error is X, Y, Z" gives the judge something to hold onto. The more the rubric spells out what each score means with concrete criteria, the less the judge falls back on length and style.

Make it reason first. G-Eval (2023) showed that having the judge produce a chain of reasoning before its verdict improves alignment with humans — the judge that explains itself judges better than the judge that blurts a number. Structure the output so the rationale comes before the score.

Anchor to humans. This is the step people skip and shouldn't. Hand-label a few dozen cases, run your judge on them, and measure agreement. If the judge and your humans agree 85% of the time, you can trust the judge's verdicts at roughly that confidence. If they agree 55%, your judge is a coin flip in a lab coat and every downstream number it produced is noise. You can't know which without the human anchor set.

Calibrating an LLM judge with position-swapped runs and a human anchor set.
Run each comparison in both orderings, keep consistent verdicts, calibrate against human labels.

When it lies

Even a well-built judge has a competence ceiling, and it's exactly where you most want help. A judge can only reliably evaluate what it could do itself. So:

For these, the judge is a filter, not a verdict. Use it to triage the easy 80%, route the hard cases and the low-confidence ones to a human, and never let a judge be the final word on something it couldn't have produced.

The honest summary is that LLM-as-judge is a measurement instrument, and you don't trust an instrument you haven't calibrated. Build it carefully — pairwise, position-swapped, rubric-driven, reasoning-first — and check it against human labels before you believe a single score it gives you. Do that and it'll carry most of your eval load cheaply. Skip the calibration and you've built a confident, biased, scalable way to be wrong about your own product. The judge is a model. Evaluate it like one.

Leave a Reply

Your email address will not be published.