The Essential Guide To LLM Evaluations

Shipping an LLM-powered feature to production without evals is like deploying software without tests: it works, kind of, hopefully, but not really, and you don't even know how it can fail. This whole article assumes you wouldn't deploy untested code, and it's going to teach you how to apply the same rigor of tests to Generative AI applications.

Traditional code is deterministic: for the same input you get the same output, every time. LLMs aren’t. They sample from a probability distribution over an effectively infinite response space, and don't always produce the same output.

For deterministic tasks that's an implementation challenge, but not really a testing challenge. So long as you know what the expected output should be, you can just test for actual_output == expected_output and easily know whether the LLM says 2+2 equals 4 or 5. The challenge is when you want that variation in the output. When you can't define the expected output, because it's not a single thing, but rather a potentially infinite subset of the larger potentially infinite possible outputs. For example, if you ask AI to write this article. What value would you assign to expected_output in that case?

Btw, no, AI didn't write this article. I did ask, many times. It's still not there yet, not even close. But that's a separate discussion.

So, how do we test that the article the LLM writes is “good”, if we don't have a single value or a finite set of values to compare to? We don't test it, we evaluate it. With evaluations (evals, once you get familiar with them).

This article is sponsored by:

Every week, AI makes your team faster—and your data more exposed. Files move to new tools, models train on sensitive inputs, and traditional DLP is blind to the context that matters most.

On February 3 at 11:00 AM PT, we’re pulling back the curtain on Cyberhaven’s unified DSPM & DLP platform—and showing how a single, AI-native platform can finally keep up with how data actually moves.

Join the live launch to see:

The first public demo of our unified AI & data security platform, built for the realities of 2026 and beyond—SaaS sprawl, shadow AI tools, and data that never sits still.
How security teams gain x-ray vision into data usage, so they can spot the risky handful of actions hidden in millions of “normal” events—and stop them in real time, not after the damage is done.
Candid stories from security leaders on where legacy DLP and standalone DSPM break down, and how they’re rethinking data protection around context instead of static rules.
A look ahead at what’s next for DLP, insider risk, AI security, and DSPM from Cyberhaven’s product and leadership teams—plus where we’re investing next.

Don’t wait for the next AI-related incident to show you the gaps in your data security stack. Reserve your spot and be among the first to see how a unified DSPM + DLP platform can change the way your organization protects its most critical data.

How to Measure Quality in Generative AI Applications

If you ask me to tell you if something is “good”, the first thing I'll do is ask you what do you mean by “good”. We can't define a golden answer (that's what you'd call that expected_output), but we should be able to identify a set of measurable characteristics that we can use to determine if any given answer is “good” or not, in a repeatable way.

Some of these characteristics may be factual accuracy, harmlessness, contextual relevance, or even overall tone. And for each characteristic we need to define how we're going to measure it.

Depending on the metric you're evaluating, you'll have to pick one of these two:

Reference-based evaluation compares LLM output against predefined ground truth or golden answers. It's great when you can come up with several examples of a good answer, though creating this golden dataset is not as trivial as it sounds, you need to carefully choose examples that represent all the ways in which an answer can be “good” in this specific characteristic.
Reference-free evaluation evaluates intrinsic qualities without comparing the output to a specific reference. It's a good fit for open-ended questions where the quality is judged on criteria like coherence, relevance, or adherence to guidelines.

And you can pick how you evaluate it (most semantic metrics don't support the first option):

Automated evaluation uses algorithmic metrics like the presence of certain words or repetition of patterns. These are much faster and cheaper to calculate than more nuanced metrics that rely on semantics, but they're also much less reliable. They're usually a good complement.
Human evaluation consists of having a human judge the outputs based on certain guidelines. This is a very tried and true method, literally what we've been doing in education, interviews, etc for centuries. It's also slow and expensive, so we generally try to avoid it, but it's useful to know that it's there. Tools like https://lmarena.ai/ use it to generate benchmarks on models.
Model-based evaluation is like human evaluation, but using an LLM as a judge (hence why we often call it LLM-as-a-Judge). The Judge LLM can understand nuance at a similar level to a human, and we can encode the evaluation guidelines into the prompt. This can be somewhat expensive, to the tune of a few cents or even a few dollars for every run (which is a lot if you compare it to running unit tests), but it's much cheaper than human evaluation.

Let's dive a bit deeper into those.

Automated Evaluation and Programmatic Metrics

Programmatic metrics are scalable, repeatable, very cheap to run, and they catch regressions fast. You'll thank me when you accidentally break formatting in a way that causes everything to crash, and detect it via programmatic metrics. They tend to work reasonably well for Reference-based evaluation.

Foundational lexical metrics

These metrics check for lexical overlap, quantifying the similarity between model-generated text and the references you provide by counting shared word sequences. Basically, they look at the text without really understanding it, at all. They're older than LLMs, but they're still reasonably useful and very cheap.

The most common are BLEU (Bilingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and METEOR (Metric for Evaluation of Translation with Explicit Ordering). There are a few more, but honestly they're boring to explain.

Embedding-based metrics

Lexical metrics fail because different words can mean the same thing. A pretty cool idea to beat that is embedding-based methods, which try to score meaning similarity instead of string similarity.

BERTScore generates contextual embeddings for each token in candidate and reference texts. It computes cosine similarity between each candidate token and the most similar reference token, and then aggregates these scores to calculate embedding-based precision (how well candidate tokens are supported by reference), recall (how well reference tokens are captured by candidate), and F1 score (the harmonic mean of precision and recall).

The advantage of this is that it recognizes paraphrases and synonyms as high-quality matches, instead of looking for literal words. The main limitation is that if there's a one-word difference you'll still get a high score, even if that word is “no” and it changes the meaning of a sentence as a whole.

LLM-as-a-Judge Evaluation

Ideally you'd have reference outputs for everything. However, in some use cases such as very open-ended tasks like creative writing, or when the input space is very large like in multi-turn conversations, a single golden answer can't possibly cover everything that a “good” answer should have. In those situations human evaluation would be the best, if it wasn't so slow and expensive. So we tend to prefer LLM-as-a Judge evaluations.

The idea is to leverage the contextual understanding capabilities of modern LLMs to perform nuanced qualitative assessments. This LLM will evaluate outputs where “good” is subjective and depends on many factors such as helpfulness, coherence, creativity, or adherence to a certain tone.

However, you can't just send a prompt saying “Tell me if this answer is good”. You need your Judge prompt to include:

The original prompt that produced the output you're evaluating
The output you're evaluating
The criteria that output should be evaluated on, including the scale (I recommend binary YES-NO, 1 to 3, or 1 to 5)
Examples on what each number on the scale means. This is, an output that would be scored 1, an output that would be scored 2, and an output that would be scored 3. You shouldn't use more than 3 to 5 examples, even if you're evaluating multiple criteria

The output of the Judge should include for each criterion:

The name of the criterion, e.g. clarity, completeness, etc
The score
The reasoning

That output should be in JSON, so you can parse it programmatically and calculate metrics like number of evals failed, and use that to fail CI pipelines and create reports just like you do for tests.

Here's an example of a Judge prompt:

Evaluate whether a customer service response provides actionable guidance.

CRITERION

Actionability: Does the response provide clear, specific steps the customer can immediately take?

- YES: Includes concrete actions with sufficient detail to execute
- NO: Remains vague, theoretical, or lacks practical guidance

EVALUATION PROCESS

Think step-by-step:
1. Identify what action the customer needs to take
2. Check if the response provides specific, executable steps
3. Note any vague language ("soon", "should", "might") that reduces actionability
4. Render your verdict with supporting evidence

OUTPUT FORMAT

{
  "score": "YES" | "NO",
  "reasoning": "Brief explanation with specific evidence",
}

CALIBRATION EXAMPLES

Example 1 - YES:
- Query: "How do I reset my password?"
- Response: "Click 'Forgot Password' on the login page, enter your email ([email protected]), then check your inbox for a reset link valid for 2 hours."
- Evaluation: `{"score": "YES", "reasoning": "Provides 3 specific steps with concrete details (button name, which email, time limit)"}`

Example 2 - NO:
- Query: "How do I reset my password?"  
- Response: "You can reset your password through our account recovery process. Let me know if you need help!"
- Evaluation: `{"score": "NO", "reasoning": "Says what is possible but provides no steps on how to actually do it"}`

---

NOW EVALUATE

Customer Query: {CUSTOMER_MESSAGE}

Response: {RESPONSE}

Your Evaluation:

You're going to use that Judge prompt to evaluate Actionability. Once you have it, you'll need to collect several relevant CUSTOMER_MESSAGE and RESPONSE pairs, and those are going to be your test cases of sorts. But before you're ready to run these, you need to make sure the Judge agrees with you on your definition of “good”.

Calibrating the Judge

LLM-as-a-Judge evaluation mainly consists of three steps (which you repeat for each evaluation criterion):

Figuring out what a “good” answer looks like, to the point where you can look at several answers and reliably answer whether they're good or not
Collecting sufficient queries and responses to cover many use cases relevant to your application
Creating a Judge prompt that gives to those responses the same score that you would give them (this is calibrating the Judge)

That's why you want that “reasoning” field in the response. You need to make sure the Judge's score agrees with your score. You'll find that the main limitation to this isn't really in how you write the Judge prompt, but rather in your own understanding of why you would pick a certain score, and overall what a “good” answer looks like.

LLM Evaluation Frameworks and Tools

Amazon Bedrock Evaluations is AWS's option for evals. You can use Amazon Bedrock directly to generate the outputs and evaluate them, or you can Bring Your Own Inference, importing a JSONL file with your prompts and outputs, and letting Amazon Bedrock run automated metrics and LLM-as-a-Judge evaluations.

Another good option is LangChain's OpenEvals, especially if you're using LangChain already. Though if you're using Python I prefer DeepEval, which lets you treat evals like unit tests (they're more akin to functional tests in my opinion, but that's besides the point, they're definitely tests). And if you like TypeScript, I've used Promptfoo with a couple of customers and it works very well. There are more tools around though. Treat them like you'd treat a unit tests framework: try a few if you want, pick one, and mainly focus on the tests themselves (or the evals themselves in this case) instead of on the framework.

Evals in the Development Lifecycle

Evals are tests for LLMs. That's the role they play in Generative AI applications. And like tests, the real value isn't in using them once to test one single version, but in running them every single time you make a change, giving you confidence that you're not inadvertently breaking something.

The difference between evals and tests (besides evals being a lot harder to write) is that you shouldn't aim for a perfect score with evals. If you get a perfect score, it's likely that your evals are not testing some edge cases.

Instead of aiming for a perfect score, what you need to do is define a threshold of acceptable quality. You can set for each metric a minimum value that all evals must meet, e.g. a minimum score of 4 in a 1-5 scale, plus a threshold for the average score across all evals in that metric, such as 4.5/5. This gives you a reliable quality gate that lets you catch regressions.

Moreover, it gives you room to improve upon your prompts. You can calculate your evals scores for the current version of a prompt, make some changes, and run the evals suite again. Comparing the score will tell you if the change produces better, worse, or equivalent results. You can use this to improve your evals scores, to deal with specific edge cases without regressing on other use cases, and even to try to reduce your prompt without losing quality.

You should also use evals to evaluate models. Everyone knows Claude Opus 4.5 is “better” than Claude Sonnet 4.5, though it's also more expensive. With a good evals suite you can get a score for both models and determine precisely how much better Opus is for your specific application, and make an informed decision based on price and actual performance for you, not on a generic and probably gamed benchmark. Again, use this to improve the scores, or to find the cheapest model that gives you acceptable scores.

Of course, don't (or don't just) do this in your computer, as an experiment. To be effective at catching regressions, evals should be a quality gate in your CI pipeline, just like normal tests are. Pick a threshold, and commit to never going below that level of quality. If a change pulls you down, it's a regression and it should be considered equivalent to introducing a bug.

Conclusion

The LLM is part of your Generative AI application, and evals are the only tests you can write for it. Moreover, as you write your evals you'll find that defining what “good” means helps you improve your application significantly, at least in my experience.

Treat evals like tests, use them to catch regressions, and commit to a certain level of quality, failing your CI if you go below it. Vibes are fine to get started, but if you want to build serious software, make quality measurable in a repeatable way. And evaluate it. With evals.

The Essential Guide To LLM Evaluations

How to Measure Quality in Generative AI Applications

Automated Evaluation and Programmatic Metrics

Foundational lexical metrics

Embedding-based metrics

LLM-as-a-Judge Evaluation

Calibrating the Judge

LLM Evaluation Frameworks and Tools

Evals in the Development Lifecycle

Conclusion

Did you like this issue?

Reply

Recommended for you

Subscribe to get new articles

Quick Links

Subscription

Socials