LLM Evaluation: How to Evaluate LLM Apps

LLM evaluation is how you find out whether a change to your language-model feature actually made it better — or quietly made it worse. If you ship prompt tweaks, swap models, or add a tool and judge the result by eyeballing a few outputs, you are not evaluating; you are guessing. This guide shows how to build a practical evaluation harness for LLM applications: task-specific metrics, offline evals, using LLM-as-judge without being fooled by it, wiring evals into your CI as regression tests, and guarding against the prompt and model drift that degrades systems silently over time.

A quick scoping note. If your system is retrieval-augmented, the retrieval half needs its own metrics — recall@k, faithfulness, context precision — which we cover in depth in our guide to RAG evaluation and retrieval quality. This piece is about the broader problem: evaluating any LLM-powered feature, RAG or not, as a piece of software you have to trust in production.

Why "it looks better" is not evaluation

LLMs are non-deterministic. Ask the same question twice and you can get two different answers, both plausible. That single fact breaks the way engineers usually verify software. You cannot run it once, see a good output, and conclude it works — the next user, the next phrasing, or the next model update can produce something entirely different. And humans are terrible at fair sampling: we remember the demo that dazzled and forget the three that flopped.

So "it looks better after my change" is not a finding. It is a vibe. Real evaluation replaces the vibe with a repeatable measurement: a fixed set of inputs, a defined notion of a good output, and a score that moves when quality moves. Without that, every prompt change is a coin flip you cannot see the result of, and every model upgrade is a leap of faith. This is the discipline behind every LLM optimisation engagement we run — you do not tune what you cannot measure.

Start from the task, not the metric

There is no single "LLM score." What good looks like depends entirely on what the feature does, so the first job is to name the task and derive metrics from it. Most LLM features fall into a handful of shapes, each with its own way of being measured.

Classification and routing

If the model sorts inputs into categories — intent detection, ticket routing, sentiment, moderation — you are in classic machine-learning territory and the classic metrics apply. Accuracy, precision, recall, and a confusion matrix tell you exactly where it goes wrong: which categories bleed into which. These are cheap to compute and leave little room for argument, so build them first wherever your feature has discrete correct answers.

Extraction and structured output

When the model pulls fields out of text or emits JSON, evaluate at the field level. Does the output parse against the expected schema at all? For each field, does the extracted value match the gold value, exactly or within a tolerance? Schema-validity rate and per-field accuracy catch the failure that a single overall score hides — a response that looks right but silently drops one field in twenty.

Summarisation and open-ended generation

This is the hard case, because there is no single correct answer. Here you lean on faithfulness (does the summary invent anything not in the source?), relevance (does it cover what matters?), and rubric-based scoring against explicit criteria you define — tone, completeness, format. Reference-based string metrics like BLEU or ROUGE are weak proxies for open generation; a criteria rubric graded consistently is far more informative.

Agents and tool use

For anything that takes actions — calling tools, querying APIs, executing multi-step plans — the metric that matters is task success: did it accomplish the goal? Underneath that, measure step-level correctness and tool-call accuracy (did it pick the right tool with the right arguments?). An agent can produce a fluent final message while having taken a wrong action three steps back, so evaluate the trajectory, not just the last sentence.

Building the evaluation harness

An eval harness has three parts, and none of them is exotic. First, a dataset: a set of inputs paired with the expected output or the criteria for a good one. Second, one or more metrics appropriate to the task. Third, a runner that feeds every input through the current version of your system, scores the outputs, and reports the aggregate. Run it, change something, run it again, compare. That loop is the entire game.

You do not have to build the runner from scratch. Open frameworks like OpenAI Evals and EleutherAI's lm-evaluation-harness give you reusable templates, consistent scoring protocols, and model-graded judging out of the box, so you can stand up a task-specific eval quickly and keep results comparable across runs. The dataset is where your real effort goes: draw it from actual usage, cover the hard and ambiguous cases deliberately, and keep it versioned so today's score is comparable to last month's. Fifty well-chosen examples beat five hundred lazy ones — the same principle that governs a good RAG test set applies to every LLM task.

LLM-as-judge, used carefully

For open-ended tasks you often cannot score outputs with a simple string match, so you ask another model to grade them — the LLM-as-judge pattern. It is genuinely useful and it scales, but it is an instrument that needs calibrating, not an oracle. Judges tend to prefer longer answers, can favour their own model family's style, and drift when you change the judge model or its prompt. We go deeper on this in the RAG evaluation guide; the short rule for any LLM app is the same — periodically check the judge against human ratings on a sample, report scores as trends rather than absolutes, and never let a single automated number gate a release on its own.

Turn evals into regression tests

Here is where LLM evaluation stops being a research activity and becomes engineering. An eval you run by hand once is interesting; an eval that runs automatically on every change is a safety net. Wire your eval suite into CI so that a pull request touching a prompt, a model choice, or a tool definition reruns the evals and blocks the merge if a key metric regresses past an agreed threshold. Treat a quality drop exactly like a failing unit test — something to fix before shipping, not a surprise you learn about from users.

This is eval-driven development, and mature teams run it the way they run test suites: a fast smoke set on every commit, a fuller battery nightly, and a hard gate before release. Running a battery of evals against every new model checkpoint to catch regressions before they reach users is now standard practice, not a luxury. Once evals are a gate rather than a report, prompt engineering becomes a controlled, measurable process instead of a nervous game of whack-a-mole — and the whole team can change the system with confidence because the harness will catch what they break.

A worked example: an email-reply assistant

Abstract advice is easy to nod along to and hard to apply, so make it concrete. Say you are building a feature that drafts replies to inbound customer emails. What does evaluating it actually look like?

Start with the dataset. Pull fifty real inbound emails from your history, spanning the range you actually receive: simple questions, angry complaints, multi-part requests, off-topic messages, and a few with no reasonable answer. For each, write down what a good reply must do — not the exact words, but the criteria: answer every question asked, get the facts right, strike the right tone, and never promise something you cannot deliver.

Pull quote: "It looks better after my change" is not a finding. It is a vibe. Evaluation replaces the vibe with a number. - Crux Digits

Now pick metrics per criterion. Factual correctness and "did it address every question" are well suited to an LLM-as-judge grading against a rubric. Tone can be a judge score too, calibrated against a handful of human ratings. A hard rule — "must not invent a refund policy" — can be a deterministic check that scans for forbidden claims. Run all fifty through the current prompt, score them, and you have a baseline. Change the prompt to fix the angry-complaint cases, rerun, and you can see instantly whether you helped those without hurting the simple ones. Wire that suite into CI with a threshold — say, no metric may drop more than two points — and every future change is guarded automatically. That is the whole method, and it scales from one feature to a hundred.

Common mistakes to avoid

Evaluating on the happy path only. A dataset of easy questions proves nothing. The hard, ambiguous, and out-of-scope cases are where features fail and where evaluation earns its keep.
One overall score. A single number hides which task or category regressed. Score per task type and per criterion so a failure points at its cause.
Trusting the judge blindly. An uncalibrated LLM judge can drift or reward the wrong things. Check it against human ratings periodically.
A frozen dataset. An eval set that never absorbs new production failures slowly stops reflecting reality.
Chasing a perfect score. Set targets that fit your risk profile and spend effort on the cases that actually fail, not on squeezing points from cases that already work.

The drift problem: your system changes when you don't touch it

The most unsettling failure mode in LLM applications is the one where nothing in your code changed and quality dropped anyway. LLM systems drift, and there are three sources worth watching.

Prompt drift. As prompts accumulate edits from different people, they pick up contradictions and dead instructions that quietly degrade output. Evals catch this the moment a change lowers a score.
Model version drift. Providers update hosted models — weights, safety policies, serving infrastructure — often without changing the API endpoint. Documented incidents from major providers show these silent updates can shift behaviour, reduce task accuracy, and increase refusals on requests that used to work. Your code is identical; the model underneath is not.
Data drift. The inputs your users send change over time — new topics, new phrasing, new edge cases your original dataset never saw.

The defences are practical. Pin to dated model snapshots in production rather than a floating "latest" alias, so an upgrade is a decision you make and evaluate, not one that happens to you. Keep a canary running on the newest model against your eval suite, so you can diff its behaviour before you promote it. And feed newly discovered failures from production back into your dataset so the harness keeps pace with reality. Choosing which model to depend on in the first place is its own decision — our comparison of OpenAI vs Anthropic vs open-source LLMs lays out the trade-offs, including the control that self-hosting gives you over exactly this kind of silent change.

Set thresholds you can defend

A score on its own does not tell you whether to ship. You need a decision rule, and that means agreeing in advance what "good enough" is for each metric — before you see the numbers, so the target is not quietly moved to match whatever the model happened to produce. Tie the threshold to the risk of the task: a feature that drafts internal notes can tolerate a lower bar than one that quotes prices to customers or touches anything regulated.

Think in error budgets rather than perfection. Perfect scores are usually a sign your dataset is too easy, not that your system is flawless, so decide instead how much regression you will accept on any single metric and how much aggregate quality you require to release. A common, defensible rule is that no key metric may drop by more than a small fixed margin against the current baseline, and any headline metric must stay above its floor. Report the numbers as a short trend line the whole team can read, not a wall of figures — the point of evaluation is a decision, and a threshold turns a score into one. Spend your improvement effort on the cases that fall below the line, and leave the ones already clearing it alone.

Beyond offline: evaluate in production too

Offline evals on a fixed dataset tell you how the system performs on the cases you anticipated. Production tells you how it performs on the ones you did not. Mature LLM evaluation runs on both. In production you cannot label every response, so you lean on cheaper signals: explicit thumbs-up and thumbs-down, implicit signals like whether a user rephrases or abandons, task-completion and conversion rates for the workflows the feature supports, and periodic LLM-as-judge scoring on a sample of real traffic.

Two techniques bridge offline and live. Shadow evaluation runs a new prompt or model on real inputs without showing users the result, so you can compare it against the current system on live traffic risk-free. A/B testing then exposes the change to a slice of users and measures the outcome that actually matters to the business. When live signals dip, the newly surfaced hard cases go back into your offline dataset, and the two loops reinforce each other. An eval set that never learns from production slowly goes stale.

Keep humans where they count

Automated evaluation is essential for speed and scale, but it measures proxies, not ground truth. A feature can score well and still frustrate users because the outputs are the wrong length, tone, or level of detail — and LLM judges have their own blind spots. So keep periodic human review in the loop for the judgement calls that automation cannot make, and use the automated harness for what it is good at: catching regressions fast and comparing options at scale. The two are complements. Any vendor who claims a single dashboard number proves their LLM feature is "accurate" is selling a proxy dressed up as a guarantee.

Evaluation is also governance

For European organisations, evaluation is not only good engineering — it is increasingly due diligence. The EU AI Act places obligations around accuracy, robustness, and record-keeping on higher-risk AI systems, and you cannot demonstrate accuracy you have never measured. A documented evaluation process — a versioned dataset, tracked metrics, a record of what changed and what it did to quality — is exactly the evidence such obligations expect. The same logic underpins the measurement functions of the NIST AI Risk Management Framework, which treats measurable, testable performance as a precondition for trustworthy AI. Evaluation is how you turn "the model seems fine" into something you can stand behind.

None of this requires a research lab — just the discipline to measure the right things and act on the numbers. That is how we approach every LLM build: see how it works in our model comparison, review our transparent pricing, or book a free consultation and we will map the evaluation your application actually needs. If you cannot yet say, with numbers, whether your last change helped or hurt, that is the place to start.

Frequently asked questions

What is LLM evaluation?

LLM evaluation is the systematic measurement of whether an LLM-powered feature performs well and whether a change improved or degraded it. Because language models are non-deterministic, you cannot verify quality by eyeballing a few outputs; you need a fixed dataset of inputs, task-specific metrics, and a repeatable score. It spans offline evals on a curated dataset and online evaluation on live production traffic.

How is LLM evaluation different from RAG evaluation?

RAG evaluation is a specialised subset focused on retrieval-augmented systems, where you also measure the retrieval step with metrics like recall@k, MRR, and context precision. LLM evaluation is the broader discipline of measuring any LLM feature — classification, extraction, summarisation, or agents — with or without retrieval. If your system uses retrieval, you need both: the general LLM eval plus the retrieval-specific metrics covered in our RAG evaluation guide.

Why should evals run in CI?

Because an eval you run by hand once only tells you about that moment, while an eval wired into CI catches regressions on every change automatically. When a pull request touches a prompt, model, or tool, the suite reruns and blocks the merge if a key metric drops below an agreed threshold — treating a quality regression like a failing unit test. This is eval-driven development, and it lets a team change an LLM system with confidence.

How do I stop my LLM system from silently getting worse?

Guard against the three sources of drift. Pin to dated model snapshots in production instead of a floating "latest" alias, so provider updates are a decision you evaluate rather than one that happens to you. Run a canary on the newest model against your eval suite before promoting it. And continuously feed newly discovered production failures back into your dataset so the harness keeps pace with changing inputs. Regular evals turn silent degradation into a visible, fixable signal.

How to Evaluate LLM Applications: A Practical Guide