A team swaps in a new reranker and declares the RAG system "better." How do they know? They tried four queries and the answers looked good. That's not evaluation — that's a vibe with a sample size of four. And it's how most RAG systems are "measured," right up until one ships a regression nobody catches because nobody was counting.
The reason RAG evaluation is genuinely hard, harder than evaluating a classifier, is that a wrong answer has two possible culprits and the symptom looks identical. Either retrieval failed — the right evidence never made it into the context — or generation failed — the evidence was right there and the model ignored it, misread it, or made something up anyway. From the outside both produce "bad answer." If your evaluation can't tell them apart, it can't tell you what to fix, and you'll spend a week tuning the retriever when the prompt was the problem.
So the first principle of RAG evaluation: measure the two stages separately before you measure the whole.
Retrieval metrics: did the right evidence show up?
This half you can borrow wholesale from decades of information-retrieval work, and you should, because it's cheap and it doesn't need an LLM. Given a query and a set of documents you've labeled as relevant, you ask of the retrieved list:
- Context recall — of all the relevant chunks that exist, how many did we retrieve? This is the one that catches the silent killer. If recall is 60%, the answer was unwinnable on 40% of queries before the model wrote a word, and no amount of prompt engineering saves you. Low recall is a retrieval problem, full stop.
- Context precision — of what we retrieved, how much was actually relevant, and did it rank near the top? Low precision means you're padding the context with junk, which costs tokens and — remember lost-in-the-middle — actively degrades generation by burying the good chunk among distractors.
- MRR and NDCG — rank-aware scores that reward putting the right chunk high, not just somewhere. These are what move when you add a reranker, which is exactly why you measure them around the reranker.
The unlock here is that these need no LLM and no generated answer. They isolate retrieval completely. If recall is bad, stop reading — fix retrieval first, because everything downstream is built on sand. Most teams skip straight to judging final answers and never compute recall, which is why they keep "fixing" the generator for a retrieval bug.
Generation metrics: did the model use the evidence honestly?
This half is harder because there's no clean label for "good answer," so a tool like RAGAS — the framework that popularized automated RAG eval in 2023 — uses an LLM as the judge and decomposes "good" into measurable pieces:
- Faithfulness — is every claim in the answer supported by the retrieved context? This is the hallucination detector. The judge breaks the answer into individual claims and checks each against the context; an answer that asserts things the context never said scores low even if those things happen to be true. Faithfulness is arguably the single most important RAG metric, because an unfaithful answer is the exact failure RAG was supposed to prevent.
- Answer relevancy — does the answer actually address the question, or does it wander off into related-but-unasked territory?
- Context relevancy / utilization — of the context we retrieved, how much did the answer actually use? Low utilization with a good answer is a hint you're over-retrieving — passing ten chunks when two did the work.
Notice these can be computed without a human-written golden answer, because they check internal consistency between question, context, and answer rather than against a reference. That's what made RAGAS practical: reference-free evaluation you can run on every change.
You still need a dataset, and that's the real work
Every metric above runs against a set of evaluation questions, and assembling that set is the part that actually decides whether your evaluation is worth anything. Tools are easy; the question set is the moat.
You need questions that look like your real traffic — including the messy, vague, multi-part ones, not just the clean ones you'd demo. For retrieval recall you need relevance labels (which chunks are correct for each question), and that's human work, though you can bootstrap it: have an LLM generate candidate question-answer pairs from your documents, then have a human prune the garbage. Frameworks like RAGAS will synthesize a starter test set for you; treat that as a draft, not a deliverable. A hundred carefully-chosen, human-reviewed questions that mirror real usage beats two thousand auto-generated ones that all share the same easy shape — because the auto-generated ones tend to be answerable by exactly the pipeline that generated them, which is the definition of a rigged test.
This is also where benchmarks like ARAGOG and ARES fit: they're useful for sanity-checking which techniques tend to help, but the only evaluation that decides what you ship is the one built on your corpus and your questions. A public leaderboard can't know what your users ask.
The judge is also a model, and it lies sometimes
Here's the honest caveat I won't paper over, because the whole generation half of your evaluation leans on an LLM judge, and that judge has the same failure modes as the model it's grading.
It can be inconsistent — score the same answer differently across runs unless you pin temperature low and prompt it tightly. It has biases — a documented tendency to prefer longer, more confident, more verbose answers regardless of correctness, and sometimes to favor outputs from its own model family. It can be fooled by an answer that's fluent and authoritative and wrong, which is precisely the kind of answer you most need it to catch. An LLM judge grading LLM output is, structurally, the fox auditing the henhouse.
This doesn't make LLM-as-judge useless — it makes it a measurement instrument that needs calibration, like any other. Calibrate it: have humans score a sample of, say, 50–100 examples, then check that the judge's scores correlate with the humans'. If they don't, fix the judge's prompt before you trust a single number it produces. Watch deltas more than absolutes — the judge's bias is roughly constant, so "faithfulness went from 0.71 to 0.78 after this change" is far more trustworthy than "faithfulness is 0.78, which is good." And keep a small human-reviewed set as ground truth that no automated metric gets to overrule.
That's the discipline. Not "RAGAS said 0.8, ship it," but: split retrieval from generation so you know what broke, build a question set that looks like reality, and treat your automated judge as a useful liar you keep honest with periodic human checks. The goal isn't a pretty dashboard. It's being able to answer "is this change actually better?" with something sturdier than four queries and a good feeling — which is the only thing that lets you keep climbing the twenty-point gap this series opened with, instead of wandering it blind.
Leave a Reply
Your email address will not be published.