RAG Evaluation: Measure & Improve Retrieval

RAG evaluation is how you find out whether your retrieval-augmented generation system is actually answering questions well — or just answering them confidently. If your RAG assistant returns wrong, vague, or half-right answers, the fix almost never starts with a better prompt. It starts with measurement. This guide shows how to evaluate a RAG pipeline end to end: the retrieval metrics that tell you whether the right context was even found, the generation metrics that tell you whether the model used it faithfully, and a practical way to build the test set that makes all of it repeatable.

The core idea to hold onto is this: a RAG system has two engines, and a bad answer can come from either one. Retrieval can fail to fetch the right passage, or generation can ignore, misread, or embellish a passage that was retrieved perfectly. Lump them together and you will spend weeks tuning the wrong half. Good RAG evaluation keeps them separate, measures each, and tells you exactly where the answer went wrong.

Why a RAG system fails without telling you

Traditional software fails loudly — an exception, a stack trace, a 500. A RAG system fails quietly. It returns a fluent, plausible, well-formatted paragraph that happens to be wrong. Nobody sees a red error. The user just gets bad information delivered with total confidence, and unless someone happens to know the correct answer, the failure is invisible. That is what makes RAG dangerous in production and why "it seems to work" is not a quality bar you can trust.

The only defence is systematic measurement. You cannot eyeball a hundred answers and call it evaluation, and you certainly cannot ship on the strength of a demo where you cherry-picked the three questions it got right. You need numbers that move when quality moves — and you need to know which number points at which part of the pipeline. This is the same discipline we bring to every LLM optimisation engagement: measure first, tune second.

Split the problem: retrieval versus generation

Before you measure anything, internalise the split. A RAG answer is only as good as two things in sequence. First, did the retriever pull the passages that actually contain the answer? Second, given those passages, did the language model produce an answer that is faithful to them and relevant to the question? These are different failures with different fixes.

If retrieval is broken, no amount of prompt engineering will save you — the model is being asked to answer from the wrong pages. If generation is broken, a perfect retriever is wasted because the model ignores what it was handed. Every metric below belongs to one side of that line, and the first job of evaluation is to tell you which side is bleeding.

Measuring retrieval quality

Retrieval is, at its heart, a ranking problem: given a query, put the right chunks near the top. The metrics that matter are the classic information-retrieval ones, computed against a set of queries where you know which documents are relevant.

Recall@k

Of all the passages that should have been retrieved for a query, what fraction showed up in your top k results? Recall@k is the single most important retrieval metric for RAG, because if the answer-bearing chunk is not in the top k you feed the model, the model cannot possibly ground its answer in it. Low recall is the most common — and most overlooked — cause of bad RAG answers.

Precision@k

Of the k passages you retrieved, what fraction were actually relevant? Precision matters because irrelevant chunks are not harmless filler — they dilute the context, distract the model, and can pull the answer toward a plausible but wrong tangent. High recall with low precision means you are drowning the right passage in noise.

MRR and nDCG

Rank position matters, not just presence. Mean Reciprocal Rank (MRR) rewards putting the first relevant result as high as possible — it cares where the first correct hit lands. Normalised Discounted Cumulative Gain (nDCG) goes further, rewarding systems that rank all relevant passages highly and penalising relevant results buried near the bottom. Together they tell you whether your retriever merely contains the answer or actually surfaces it.

Read these as a set. Strong recall but weak MRR means the right chunk is in there but ranked too low, and a re-ranker will help. Strong precision but weak recall means your retriever is cautious and missing relevant material, and you need better chunking or hybrid search. The pattern points at the fix.

Measuring generation quality

Assume, for a moment, that retrieval did its job. The model has the right passages in front of it. Now you have to measure whether the answer it wrote actually deserves trust. This is where reference-free frameworks like RAGAS have become standard. RAGAS, introduced in a 2023 paper by Es and colleagues, scores RAG answers without needing a human-written gold answer for every question, which is what makes routine evaluation affordable. Its published metric definitions map cleanly onto the questions you actually care about.

Faithfulness (groundedness)

Does every claim in the answer trace back to the retrieved context, or did the model invent something? Faithfulness is measured by breaking the answer into individual claims and checking each against the supplied passages. This is your hallucination detector. A confident answer with 70% faithfulness is a confident answer that made up nearly a third of itself — and in a legal, medical, or financial setting that is a liability, not a feature.

Answer relevance

Is the answer actually addressing the question that was asked, or is it a well-grounded reply to a slightly different question? A model can be perfectly faithful to the context and still miss the point. Answer relevance catches the "technically true but unhelpful" response.

Context precision and context recall

These evaluate the retrieved context against the ideal, from the generation side. Context recall asks whether the passages contained everything needed to answer; context precision asks whether the useful passages were ranked above the useless ones. They are the bridge between your retrieval metrics and your answer quality — and when a system scores well on faithfulness but badly here, you have found your bottleneck.

LLM-as-judge: powerful, and quietly fallible

Most of these generation metrics are computed by asking another language model to judge the answer — the "LLM-as-judge" pattern. It is genuinely useful: it scales to thousands of examples, it is fast, and it correlates reasonably well with human judgement on many tasks. Without it, reference-free evaluation would not be practical at all.

But treat the judge as an instrument that needs calibrating, not an oracle. LLM judges carry known biases: they tend to prefer longer answers, they can favour the style of the model family they belong to, and their scores drift when you change the judge model or even the prompt. The honest practice is to periodically check the judge against human ratings on a sample, report scores as trends rather than absolute truths, and never let a single automated number gate a release on its own. A judge that is never audited is just a second model you have chosen to trust blindly.

Pull quote: A RAG system has two engines, and a bad answer can come from either — evaluation tells you which one. - Crux Digits

The unglamorous core: a golden test set

Here is the part teams skip, and it is the part that matters most. Every metric above needs a set of questions paired with the right answers and the right source passages — a golden test set. Without it you are not evaluating; you are guessing. This is the real work of RAG evaluation, and it is where the leverage is.

A good test set is drawn from real usage, not invented at a desk. Pull actual questions your users ask, including the messy, ambiguous, and out-of-scope ones. Cover the hard cases deliberately: questions whose answer spans multiple documents, questions with no answer in the corpus (the system should refuse, not fabricate), near-duplicate questions with different correct answers, and time-sensitive questions where the right passage changed. Fifty carefully chosen, well-labelled questions beat five hundred lazy ones. Building and maintaining that set is a data problem as much as an ML one, and it is closely tied to the data engineering that feeds the pipeline in the first place.

Keep the set versioned and stable. The whole point is to run the same questions after every change — a new chunking strategy, a new embedding model, a new prompt — and watch the numbers move. If the test set drifts every week, you lose your baseline and with it your ability to say whether anything actually improved.

Diagnose before you fix

Once you have metrics on both halves, diagnosis becomes almost mechanical. Take a failing answer and ask two questions in order. Was the answer-bearing passage in the retrieved context? If no, it is a retrieval problem — stop thinking about the prompt. If yes, did the model use it correctly? If it ignored or distorted a passage that was right there, it is a generation problem.

This simple decision rule saves enormous amounts of wasted effort. The most common mistake we see is teams rewriting prompts for weeks to fix answers that were doomed at the retrieval step, because the correct chunk never made it into the context window. Measure recall first, and you will know within an afternoon whether you are tuning the retriever or the generator.

Improving retrieval

When the numbers point at retrieval, a handful of levers move recall and ranking the most:

Chunking. Chunks that are too large bury the relevant sentence in noise and blow the precision; chunks that are too small lose the context that makes a passage meaningful. Chunking on semantic or structural boundaries — sections, clauses, headings — usually beats fixed-size splitting.
Hybrid search. Pure vector search misses exact terms — product codes, names, acronyms — that a keyword search catches instantly. Combining dense (embedding) and sparse (keyword/BM25) retrieval lifts recall on the queries semantic search quietly fails.
Re-ranking. A cross-encoder re-ranker re-scores the top candidates with far more precision than the first-pass retriever. If your recall is high but MRR is low, this is often the single highest-return change.
Metadata and filtering. Filtering by document type, date, or department before ranking removes whole classes of wrong-but-similar passages, and it is cheap to add.

These changes compound, but only measurement tells you which one earned its keep. Add a re-ranker, rerun the golden set, and keep it only if MRR actually moved.

Improving generation

When retrieval is solid but answers are still weak, the fixes live on the generation side. Tighten the prompt so the model is explicitly instructed to answer only from the provided context and to say "I don't know" when the context does not contain the answer — refusal is a feature, not a failure. Make the grounding visible by asking for citations to the passages used, which both improves faithfulness and gives users a way to verify. And test the refusal path deliberately: a RAG system that confidently answers questions it has no source for is worse than one that admits the gap. If you are building an assistant over an organisation's knowledge base, our write-up on RAG assistants over a firm's knowledge base covers how these grounding choices play out in practice.

From offline test set to production monitoring

A golden test set tells you how the system performs on the questions you thought to include. Production tells you how it performs on the questions you never imagined. Both matter, and mature RAG evaluation runs on two clocks: offline evaluation on the fixed test set before every release, and online evaluation on live traffic after it.

Offline evaluation is your regression gate. Wire the golden set into your pipeline so that any change — a new embedding model, a re-chunk, a prompt tweak — automatically reruns the benchmark and blocks a release if faithfulness or recall drops below an agreed threshold. Treat a quality regression exactly like a failing unit test: something to fix before shipping, not a surprise you discover from user complaints two weeks later.

Online evaluation is different in kind. Here you cannot label every answer, so you lean on cheaper signals: thumbs-up and thumbs-down feedback, whether users rephrase the same question (a sign the first answer missed), how often the system refuses, and periodic LLM-as-judge scoring on a sample of real conversations. These signals surface the failure modes your test set does not cover — new topics, shifting user intent, and the slow drift that creeps in as your underlying documents change. When the live signals dip, you feed the newly discovered hard questions back into the golden set, and the two clocks reinforce each other. A test set that never grows from production is a test set slowly going stale.

The practical discipline is to set explicit targets rather than chase a perfect score. Decide what faithfulness and recall levels are good enough for your risk profile, alert when you fall below them, and spend your improvement effort on the questions that actually fail — not on squeezing another point out of cases that already work.

Be honest about the limits of automated metrics

Automated evaluation is essential, but it is not the whole truth, and a good practice says so. The metrics measure proxies — faithfulness to retrieved text, not real-world correctness; relevance to the query, not usefulness to the human. A system can score well and still frustrate users because the answers, while accurate, are the wrong length, tone, or level of detail. And LLM judges, as noted, have their own blind spots.

So keep a human in the loop where it counts. Automated metrics are for catching regressions fast and comparing options at scale; periodic human review is for judging whether the system is genuinely helpful. The two are complements, not substitutes. Any vendor who tells you a single dashboard number proves their RAG system is "95% accurate" is selling you a proxy dressed up as a guarantee.

Evaluation is also governance

For European organisations there is a compliance dimension that turns evaluation from good practice into due diligence. The EU AI Act places obligations around accuracy, robustness, and record-keeping on higher-risk AI systems, and you cannot demonstrate accuracy you have never measured. A documented evaluation process — a versioned test set, tracked metrics, a record of what changed and what it did to quality — is exactly the kind of evidence that framework expects. The same logic underpins the measurement functions of the NIST AI Risk Management Framework, which treats measurable, testable performance as a precondition for trustworthy AI. Evaluation is not just how you build a better assistant; it is how you prove it is fit to deploy.

None of this requires a research lab. It requires the discipline to measure the right things and the honesty to act on what the numbers say. That is the approach we take on every RAG build — you can see how it plays out in our case studies, or book a free consultation and we will map the evaluation your system actually needs. If your RAG assistant is giving answers you cannot fully trust, the path forward is not a cleverer prompt. It is measurement.

Frequently asked questions

What is RAG evaluation?

RAG evaluation is the systematic measurement of how well a retrieval-augmented generation system answers questions. It splits into two parts: retrieval quality (did the system fetch the right passages, measured with recall@k, precision@k, MRR and nDCG) and generation quality (did the model answer faithfully and relevantly, measured with faithfulness, answer relevance and context precision/recall). Measuring both separately tells you exactly where a bad answer came from.

How do I know if the problem is retrieval or generation?

Take a failing answer and ask two questions in order. First, was the passage that contains the answer present in the retrieved context? If not, it is a retrieval problem and no prompt change will fix it. If the passage was there but the model ignored or distorted it, it is a generation problem. Measuring recall first usually settles the question within an afternoon and prevents weeks of tuning the wrong half.

Is LLM-as-judge reliable for scoring RAG answers?

It is useful but not infallible. LLM judges scale to thousands of examples and correlate reasonably with human ratings, which makes reference-free evaluation practical. But they carry biases — a tendency to prefer longer answers, a leaning toward their own model family, and score drift when you change the judge or prompt. Calibrate the judge against human ratings on a sample, treat scores as trends, and never let a single automated number gate a release alone.

Why does a RAG system need a golden test set?

Because every evaluation metric needs questions paired with the right answers and source passages to score against. A golden test set drawn from real user questions — including hard, ambiguous and out-of-scope cases — lets you run the same benchmark after every change and see whether quality actually improved. Fifty carefully labelled questions beat five hundred lazy ones, and keeping the set versioned preserves the baseline you compare against.

RAG Evaluation: Measure & Improve Retrieval Quality