Evaluation and Monitoring

"It works." On whose machine, against which inputs, measured how?

Ask those three questions about most agents and the answers are: mine, the five I happened to try, and "it looked good." That's not evaluation. That's a vibe with a deploy button. And it holds up exactly until real traffic hits inputs you never imagined, at which point "it works" quietly becomes "it worked, sometimes, and nobody was counting."

Evaluating an agent is genuinely harder than evaluating a model, and the reason is specific: an agent doesn't produce an answer, it produces a path. There are usually many right paths and many ways to fail along a right-looking one. You have to grade the journey, not just the destination — and that breaks most of the eval instincts people bring from regular software.

Two activities people keep merging

Evaluation and monitoring sound like one thing. They're two, separated by when.

Evaluation is before. Offline, against a fixed dataset, in development. You have a set of inputs and a notion of what good looks like, you run the agent, you score it. The point is to catch regressions before they ship — change a prompt, re-run the eval, see if the number moved the wrong way.

Monitoring is after. Online, against live production traffic, with no answer key. You can't score correctness directly (nobody labeled the user's real question), so you watch proxies — latency, cost, failure rates, how often a human had to step in, thumbs-down clicks. The point is to catch the live system degrading.

You need both, and they fail differently. An agent that aces your offline eval can still crater in production because real users do things your dataset didn't. An agent that looks healthy in monitoring (fast, cheap, no errors) can be confidently wrong in ways no latency graph reveals. One without the other is half a safety net.

Grade the trajectory, not just the answer

The hard, agent-specific part. Say the task was "find the cheapest flight and book it." The agent returns "Booked! $340." Correct?

You can't tell from the answer. Maybe it checked three sites and genuinely found the cheapest. Maybe it checked one, grabbed the first result, and got lucky. Same output, completely different quality — and the second agent will betray you the moment luck runs out. So you evaluate at multiple levels:

Final output. Did it get the right answer? The thing regular evals check.
Trajectory. Did it take a sensible path? Right tools, in a reasonable order, without flailing? An agent that reaches the answer in 4 steps and one that thrashes for 40 are not equally good even when both arrive.
Per-step. Did each tool call have valid arguments? Did each step actually move toward the goal, or in a circle?

A trace feeding offline eval and live monitoring — One trace, two jobs: score trajectories offline to block regressions, watch proxies in production.

The thing that makes all of this possible is the same on both sides: tracing. You can't evaluate or monitor what you can't see. Capture the full step tree of every run — inputs, each thought, each tool call and its result, the final output. Tools like LangSmith are built for this, and the OpenTelemetry GenAI conventions give it a vendor-neutral schema so agent traces sit in the same observability stack as everything else. No traces, no evaluation worth the name — you're grading a black box by its cover.

Who grades the open-ended stuff

For "find the cheapest flight" you might have a checkable answer. For "write a helpful, friendly support reply" you don't — there's no single correct string. This is where LLM-as-judge comes in: use a capable model to score outputs against a rubric (helpful? accurate? on-brand?). The MT-Bench work showed a strong judge model agrees with human raters often enough to be useful at scale, which is what makes it practical — you can't put a human on every output, but you can put a model on every output and a human on a sample.

Useful, and not to be trusted blindly. Judge models have known tilts — they can favor longer answers, or whatever sounds confident, or outputs that resemble their own style. So you calibrate: check the judge against human labels on a sample, and if they diverge, fix the rubric before you believe the judge. (There's a whole post coming on where LLM-as-judge works and where it quietly lies — this is the overview.) The rule for now: a judge is an instrument, and an uncalibrated instrument is just a confident guess.

The dataset is the actual asset

Here's the part that decides whether any of this matters: a public benchmark tells you how the agent does on someone else's problem. It says nothing about whether it handles your users' messy, specific, weird inputs.

The teams whose agents actually hold up in production have one thing in common — a curated eval set drawn from their own traffic, especially the failures. Every time the agent screws up in production, that case gets added to the offline set, and now the agent is tested against it forever. The eval set grows, becoming a sharper and sharper picture of how your particular agent fails. That accumulating set, not any leaderboard score, is the moat. It's the difference between "passes a benchmark" and "we know exactly the 200 ways this thing has broken and it can't break that way again."

Which leaves the question worth sitting with, the one to ask before the next deploy: if your agent started getting subtly worse tomorrow — not crashing, just worse — how long until you noticed, and what exactly would tell you? If the honest answer is "an angry user, eventually," you don't have evaluation. You have hope with good uptime.

Evaluation and Monitoring

Two activities people keep merging

Grade the trajectory, not just the answer

Who grades the open-ended stuff

The dataset is the actual asset

Leave a Reply