Gen AI Foundations·May 26, 2026·5 minllm deeplearning eval

Custom Evals Are the Moat

Your model scores 89 on MMLU. Your users are still filing angry tickets. Those two facts are not in tension — they're barely related. The public benchmark measured something. It just wasn't your product working on your traffic, and the sooner you stop conflating the two, the sooner you build the one asset competitors can't copy: an eval suite that encodes what "good" means for your specific problem.

This is the meta-trend worth betting on. The industry is moving from "how does this model rank on the leaderboard?" to "does this change break my regression suite?" The leaderboard is a launch comparison. The regression suite is how you ship on a Tuesday without fear.

Why public benchmarks stopped being enough

Public benchmarks were never useless. HELM (2022) and MMLU (2020) gave the field a shared yardstick when there was none, and they still help when you're choosing a base model to start from. But three forces hollow them out as a basis for your decisions.

Saturation. When every frontier model scores in the high 80s or 90s on a benchmark, the benchmark has stopped discriminating. The gap between the top entries is inside the noise, and a one-point difference tells you nothing about which model handles your tax-form extraction better.

Contamination. Benchmarks are public, which means their questions leak into training data, on purpose or not. A model can score well by having effectively memorized the test. Researchers have warned explicitly about benchmark cheating (Zhou and colleagues, 2023, put a fine point on it), and once a test set is on the internet, you can never fully trust a high score on it again.

Distribution mismatch. This is the big one. MMLU asks academic multiple-choice questions. Your users paste in half-formed support requests, malformed CSVs, and prompts in three languages mid-sentence. A model's score on the former predicts almost nothing about its behavior on the latter. The benchmark and your workload are different distributions, and the model is only ever as good as it is on yours.

What a custom eval actually is

Strip away the tooling and a custom eval is three things:

  1. A dataset of cases drawn from your real problem — inputs that look like your traffic, each paired with a notion of what a good output is.
  2. A grader that scores a model's output on each case.
  3. A harness that runs all of it on demand and reports a number you trust.

The grader is where people overthink it. You don't need an LLM judge for everything. Match the grader to the task:

Start with the cheapest grader that captures the failure you care about. A pile of exact-match cases on real inputs beats an elaborate judge on synthetic ones.

The flywheel that makes it a moat

Here's the part that compounds. Every production failure is a free eval case. A user reports a bad answer; you capture the input, write down what the right output should have been, and drop it into the suite. Now that exact failure can never silently come back — the next model, the next prompt tweak, the next dependency bump all have to pass it.

The eval flywheel: production failures become a CI gate on every change.
Every production failure becomes a permanent regression case competitors cannot copy.

Run that loop for six months and you have a few hundred cases that encode, in executable form, every hard-won lesson about your domain — the weird inputs, the failure modes specific to your users, the edge cases nobody anticipated. That set is the moat. A competitor can read your prompts (they'll leak eventually) and copy your model choice (it's public). They cannot copy the accumulated judgment baked into your eval set, because it came from your traffic and your users' complaints. The prompt is replaceable. The eval suite is the institutional memory.

Frameworks like OpenAI's open-source Evals (2023) gave a template for the harness, but the harness was never the hard part. The dataset is. The grader is. The discipline of turning every incident into a permanent test is.

The traps

A custom eval done badly gives false confidence, which is worse than no eval:

The reframe is the whole point. Stop asking which model is best in the abstract — that question has a leaderboard and the leaderboard doesn't know your users. Start asking whether this specific change makes your specific product better or worse on cases you've decided are the ones that matter. The teams that win at AI products in 2026 aren't the ones with the cleverest prompts. They're the ones who turned their failures into a test suite nobody else can run.

Leave a Reply

Your email address will not be published.