Your model scores 89 on MMLU. Your users are still filing angry tickets. Those two facts are not in tension — they're barely related. The public benchmark measured something. It just wasn't your product working on your traffic, and the sooner you stop conflating the two, the sooner you build the one asset competitors can't copy: an eval suite that encodes what "good" means for your specific problem.
This is the meta-trend worth betting on. The industry is moving from "how does this model rank on the leaderboard?" to "does this change break my regression suite?" The leaderboard is a launch comparison. The regression suite is how you ship on a Tuesday without fear.
Why public benchmarks stopped being enough
Public benchmarks were never useless. HELM (2022) and MMLU (2020) gave the field a shared yardstick when there was none, and they still help when you're choosing a base model to start from. But three forces hollow them out as a basis for your decisions.
Saturation. When every frontier model scores in the high 80s or 90s on a benchmark, the benchmark has stopped discriminating. The gap between the top entries is inside the noise, and a one-point difference tells you nothing about which model handles your tax-form extraction better.
Contamination. Benchmarks are public, which means their questions leak into training data, on purpose or not. A model can score well by having effectively memorized the test. Researchers have warned explicitly about benchmark cheating (Zhou and colleagues, 2023, put a fine point on it), and once a test set is on the internet, you can never fully trust a high score on it again.
Distribution mismatch. This is the big one. MMLU asks academic multiple-choice questions. Your users paste in half-formed support requests, malformed CSVs, and prompts in three languages mid-sentence. A model's score on the former predicts almost nothing about its behavior on the latter. The benchmark and your workload are different distributions, and the model is only ever as good as it is on yours.
What a custom eval actually is
Strip away the tooling and a custom eval is three things:
- A dataset of cases drawn from your real problem — inputs that look like your traffic, each paired with a notion of what a good output is.
- A grader that scores a model's output on each case.
- A harness that runs all of it on demand and reports a number you trust.
The grader is where people overthink it. You don't need an LLM judge for everything. Match the grader to the task:
- Exact match / regex for extraction, classification, structured fields. Did it pull the right invoice total? Boolean. Cheap. Deterministic.
- Embedding similarity for "is this answer semantically close to the reference," when wording can vary.
- Programmatic checks for things with rules — valid JSON, code that compiles, a SQL query that returns the right rows.
- LLM-as-judge only for the genuinely subjective — tone, helpfulness, "did this summary capture the key point" — and even then with a rubric and calibration (its own topic).
Start with the cheapest grader that captures the failure you care about. A pile of exact-match cases on real inputs beats an elaborate judge on synthetic ones.
The flywheel that makes it a moat
Here's the part that compounds. Every production failure is a free eval case. A user reports a bad answer; you capture the input, write down what the right output should have been, and drop it into the suite. Now that exact failure can never silently come back — the next model, the next prompt tweak, the next dependency bump all have to pass it.
Run that loop for six months and you have a few hundred cases that encode, in executable form, every hard-won lesson about your domain — the weird inputs, the failure modes specific to your users, the edge cases nobody anticipated. That set is the moat. A competitor can read your prompts (they'll leak eventually) and copy your model choice (it's public). They cannot copy the accumulated judgment baked into your eval set, because it came from your traffic and your users' complaints. The prompt is replaceable. The eval suite is the institutional memory.
Frameworks like OpenAI's open-source Evals (2023) gave a template for the harness, but the harness was never the hard part. The dataset is. The grader is. The discipline of turning every incident into a permanent test is.
The traps
A custom eval done badly gives false confidence, which is worse than no eval:
- Too small. Twelve cases will pass or fail on noise. You want enough that a real regression is visible above the variance — dozens at minimum per behavior you care about, more for anything subtle.
- Leakage into prompts. If your eval cases end up as few-shot examples in the prompt, you're testing memorization, not capability. Keep the eval set walled off from what the system sees at runtime.
- Overfitting to the suite. If you tune prompts until the eval is perfect, you've started gaming your own benchmark — the same contamination problem, in miniature. Refresh the set with new production cases regularly so it keeps measuring reality, not your last optimization.
- Grading the wrong thing. An exact-match grader on a task where wording legitimately varies will fail correct answers and pass you nothing useful. Pick the grader that matches how the output is actually judged in production.
The reframe is the whole point. Stop asking which model is best in the abstract — that question has a leaderboard and the leaderboard doesn't know your users. Start asking whether this specific change makes your specific product better or worse on cases you've decided are the ones that matter. The teams that win at AI products in 2026 aren't the ones with the cleverest prompts. They're the ones who turned their failures into a test suite nobody else can run.
Leave a Reply
Your email address will not be published.