Evals

How AI gets measured

For anyone trying to follow conversations about benchmarks, contamination, and whether a model is actually any good.

Every new AI model arrives with a wall of numbers: 92% on this, 88% on that, state-of-the-art on the other. Evals — short for evaluations — are how those numbers are produced, and the field has built a dense vocabulary for a surprisingly hard problem: working out whether a model is actually good, and proving it to someone who is sceptical. The terms below cover how models are scored, the benchmarks you will see quoted, the many ways a score can mislead, and how teams use evals to build and ship reliably.

What an eval is
Eval (Evaluation)
A structured test of what an AI model can do, run to produce a number or a judgement you can compare. Where a demo shows one impressive output, an eval runs the model across many examples and measures how often it succeeds. Evals are how the field answers "is this model actually better?" rather than "does this one answer look good?" The word covers both the test itself ("we wrote an eval for summarisation") and the discipline of measuring models at all ("evals are the bottleneck"). Good evals are the difference between knowing your model improved and hoping it did.
Benchmark
A standardised, shared eval that many models are run against so their scores can be compared on equal terms. A benchmark fixes the questions, the scoring, and the rules, so a number from one model means the same as a number from another. MMLU, SWE-bench, and GPQA are benchmarks. Their value is comparability; their weakness is that once everyone is optimising for the same fixed test, the score and the underlying ability start to drift apart. A benchmark is a shared ruler, and shared rulers get gamed.
Eval Set (Test Set)
The collection of examples a model is scored on, held strictly apart from anything used to build or train it, so the score reflects genuine ability rather than memorisation. Splitting data into a training set (to learn from) and a test set (to be judged on) is the oldest discipline in machine learning. If the test set leaks into training, the score becomes meaningless: the model is being examined on questions it has already seen the answers to.
Ground Truth
The correct answer an eval scores against: the label a human, or a trusted source, has decided is right. A model's output is compared to ground truth to decide whether it passed. The quality of an eval is capped by the quality of its ground truth: if the "correct" answers are wrong, sloppy, or contested, the eval measures agreement with bad labels rather than correctness. Also called the gold standard, or gold labels.
Task
The specific job the eval measures the model doing: summarise this document, fix this bug, answer this exam question, extract these fields. An eval is always an eval of a task. "The model is good" is meaningless; "the model is good at multi-step web research" is a claim an eval can test. Defining the task precisely is most of the work: a vague task produces a vague eval that nobody trusts.
Capability
A thing a model can do, considered independently of any single test: reasoning, coding, translation, tool use, long-context recall. Evals measure capabilities, but a capability is broader than any one benchmark for it. A model can score well on a coding benchmark and still be poor at coding in the wild — a sign the benchmark captured only a slice of the capability. The gap between "scores on the benchmark" and "has the capability" is where most arguments about AI progress actually live.
How a score is produced
Metric
The specific number an eval produces: accuracy, pass rate, F1, win rate, average score. The metric is the lens; choose the wrong one and a good model looks bad or a bad model looks good. No single metric ever captures everything, which is why serious eval suites report several. "What's the metric?" is the first question to ask of any claim that a model is better.
Accuracy
The simplest metric: the fraction of examples the model got right. Out of 1,000 questions, 920 correct is 92% accuracy. Intuitive and endlessly quoted, but it misleads when the classes are imbalanced — a model that always predicts "not spam" scores 99% accuracy on a stream that is 99% legitimate mail, while being completely useless. Accuracy is the right metric when the classes are balanced and every error costs the same, and the wrong one the moment they are not.
Precision and Recall
Two metrics that pull apart what accuracy hides. Precision: of the things the model flagged, how many were right — how much you can trust a positive. Recall: of the things it should have flagged, how many it caught — how little it missed. A cancer screen wants high recall (miss nothing) even at the cost of precision (some false alarms). A spam filter wants high precision (never bin real mail) even if some spam slips through. You usually trade one off against the other.
F1 Score
A single number that folds precision and recall together, by taking their harmonic mean. Used when you want one metric but care about both catching things and not crying wolf. F1 punishes a model that is brilliant on one and terrible on the other, which a plain average would quietly hide. It runs from 0 to 1, higher is better. When a paper leads with "F1," it is usually because raw accuracy would have flattered the model.
Pass@k
A metric for tasks where the model gets several attempts: the probability that at least one of k samples is correct. pass@1 is single-shot success; pass@10 asks whether any of ten tries worked. Common in code evals, where you can generate ten candidate solutions and keep whichever one passes the tests. A high pass@10 with a low pass@1 means the model can find the answer but not reliably — fine if you can automatically check the output, useless if you cannot.
Exact Match
A scoring rule that counts an answer correct only if it matches the reference string exactly. Simple and unambiguous, but brutal: "Paris", "Paris, France", and "paris" all fail against a reference of "Paris" unless the eval normalises the text first. Exact match works for answers with one canonical form — a number, a multiple-choice letter — and falls apart for anything open-ended, which is precisely why model-graded evals had to be invented.
Rubric
A written set of criteria a grader — human or model — applies to score an open-ended answer: did it cite a source, was it factually correct, did it stay on topic, was the tone right. Rubrics turn a vague "is this good?" into a checklist that different graders can apply consistently. Much of the craft of eval design is the craft of writing good rubrics: specific enough that two graders agree, general enough to cover answers you did not anticipate.
Kinds of eval
Offline / Online Eval
Offline: run the model against a fixed dataset in the lab, before release. Online: measure the model on real traffic, after release, watching what actual users do. Offline evals are fast, repeatable, and safe, but only as good as their dataset's resemblance to reality. Online evals are the real thing, but slow, noisy, and only available once you have shipped. Mature teams use offline evals to decide what to ship and online evals to learn whether they were right.
Human Eval
Evaluation where people judge the model's outputs directly: rating answers, picking the better of two, marking responses right or wrong. The gold standard for anything subjective — helpfulness, tone, writing quality — because there is no formula for "good answer." Expensive, slow, and variable (people disagree, and the same person disagrees with themselves on a different day), which is why so much eval engineering is about getting machines to approximate human judgement at scale. Not to be confused with HumanEval, the coding benchmark below.
LLM-as-Judge
Using one language model to grade another's outputs, against a rubric or by comparing two answers head to head. The workhorse of modern evals: it scales human-style judgement to thousands of examples for pennies, on open-ended tasks where exact match cannot work. The catch is that the judge has biases — it tends to prefer longer answers, its own writing style, and whichever answer it happens to see first — so a judge eval has to be validated against humans before it can be trusted. A model marking homework, which works only if you check the marker.
Programmatic Eval (code-graded)
Scoring by code rather than judgement: run the model's output through assertions, tests, or parsers and check it mechanically. Did the generated SQL return the right rows? Does the code pass the unit tests? Is the JSON valid and does it contain the required fields? Programmatic evals are fast, free, and perfectly consistent, and they are the right tool whenever "correct" can be checked by a machine. When it cannot, you fall back to human or model graders.
Reference-free Eval
An eval that judges an output without a pre-written correct answer to compare against. Instead of "does this match the reference summary?" it asks "is this summary faithful to the source?" — a question a rubric or a model can answer without a gold label. Reference-free evals are essential for open-ended generation, where writing one canonical correct answer is impossible, but they are harder to trust because there is no fixed target anchoring the score.
A/B Test
Releasing two versions to real users at once — A to half, B to the other half — and measuring which produces better outcomes. The standard way to evaluate a model change in production: it controls for everything except the change itself. Used to answer questions offline evals cannot, like "do users actually prefer the new model?" The discipline is in choosing a metric that reflects real value, rather than one that is merely easy to move.
Red Teaming
Deliberately attacking a model to find where it fails: trying to make it produce harmful content, leak data, ignore its instructions, or behave unsafely. An adversarial eval, where the goal is not an average score but the discovery of worst cases. Red teaming is how safety problems are found before users find them. Its output is not a tidy number but a list of the ways the model can be broken, and how easily.
The benchmarks you'll hear named
MMLU (Massive Multitask Language Understanding)
57 subjects from elementary maths to law and medicine, asked as multiple-choice questions. For years the default measure of a model's general knowledge, quoted in every release. It is now largely saturated — top models score in the high 80s and 90s, where the remaining gap is as much about ambiguous questions and wrong answer keys as about ability. Its harder successor, MMLU-Pro, exists for exactly that reason.
GPQA (Graduate-level Google-Proof Q&A)
A few hundred questions in biology, physics, and chemistry, written by PhDs to be hard even for a sharp non-expert with a search engine and unlimited time. Designed to resist the obvious failure of older benchmarks — that the answers can simply be looked up. The "Diamond" subset is the hardest and most quoted. GPQA is one of the benchmarks that still meaningfully separates frontier models from the rest.
SWE-bench
A coding benchmark built from real GitHub issues in popular open-source Python projects. The model is given the bug report and the codebase and must produce a patch that makes the project's own tests pass. Far more realistic than toy programming puzzles, because it tests working inside a large existing codebase. SWE-bench Verified is a 500-problem subset human-checked to be fair and genuinely solvable; it has become the headline number for agentic coding ability.
HumanEval
164 hand-written Python problems, each a function signature and docstring the model must implement, scored by running the function against hidden tests with the pass@k metric. Released by OpenAI in 2021, it defined how code models were measured for years. Now saturated — frontier models solve nearly all of it — which is exactly why the field moved on to harder, more realistic benchmarks like SWE-bench.
GSM8K and MATH
Two maths benchmarks. GSM8K is 8,500 grade-school word problems testing multi-step arithmetic reasoning; MATH is competition-level problems from high-school olympiads. Both were genuinely hard for models around 2022 and are now largely solved by frontier systems, pushing the field to harder successors like AIME and FrontierMath. The arc from GSM8K to FrontierMath is a compressed history of how fast machine maths reasoning advanced.
ARC-AGI
The Abstraction and Reasoning Corpus, designed by François Chollet as a deliberate counter to saturable benchmarks. It poses small coloured-grid puzzles where you infer a transformation rule from a handful of examples and apply it — easy for humans, historically very hard for AI, and built to resist being solved by memorisation. Run as a prize competition, it is treated by its supporters as a test of fluid intelligence rather than accumulated knowledge. ARC-AGI-2 raised the difficulty again once systems began to make progress.
Chatbot Arena (Elo)
A live leaderboard (run by LMArena, formerly LMSYS) where people are shown two anonymous models' answers to the same prompt and vote for the better one. The votes are aggregated into an Elo rating — the same system used to rank chess players. Its strength is that it measures real human preference on real prompts rather than a fixed test, so it cannot be saturated in the usual way. Its weakness is that it rewards what people like — confident, well-formatted, agreeable answers — which is not always what is correct.
Humanity's Last Exam (HLE)
A benchmark of a few thousand extremely hard, expert-written questions across dozens of subjects, launched in 2025 and named for its ambition: to be the last closed-ended academic exam that stays difficult as models improve. Built in response to the saturation of MMLU and its peers, where the best models had simply run out of room to distinguish themselves. The name is a bet — that broad academic knowledge is nearly solved, and the interesting frontier has moved elsewhere.
How evals go wrong
Benchmark Saturation
What happens when the best models score so high on a benchmark that it can no longer tell them apart. Once everyone is at 95%, the remaining 5% is mostly noise and bad questions, and a new "97%" means little. Saturation is the natural death of a benchmark: MMLU, HumanEval, and GSM8K all reached it. The field's response is a constant churn of harder benchmarks, which is why the headline tests change every year or two.
Contamination (Data Leakage)
When the answers to a benchmark end up in a model's training data, so it has effectively seen the exam paper in advance. Because models train on enormous web scrapes, and popular benchmarks are published on the web, contamination is pervasive and hard to rule out. A contaminated score measures memorisation, not ability. Detecting it is its own discipline — checking whether a model does suspiciously better on published questions than on freshly written ones in the same style.
Goodhart's Law
"When a measure becomes a target, it ceases to be a good measure." The single most important idea in evals. The moment a benchmark becomes the thing everyone optimises for, models start improving the score in ways that do not improve the underlying ability — and the number stops meaning what it used to. Every benchmark is living on borrowed time the instant it becomes important. The only defence is to keep your evals private, plural, and changing.
"When a measure becomes a target, it ceases to be a good measure." — Charles Goodhart.
Teaching to the Test
Goodhart's Law in practice: training a model specifically to do well on the benchmarks it will be judged by, rather than to be broadly capable. It can be deliberate (training on benchmark-style data) or accidental (the benchmark leaked into the training set). Either way the published score overstates real-world ability, and the gap shows up the first time the model meets a problem that is not shaped like the test.
Construct Validity
The question of whether an eval actually measures the thing it claims to. A benchmark labelled "reasoning" might in fact reward pattern-matching to its question format; a "helpfulness" score might mostly track answer length. Construct validity is the gap between the label on the eval and what is really being tested. Most arguments about whether a benchmark "means anything" are arguments about its construct validity, even when nobody uses the phrase.
Variance (Noise)
The amount a score moves between runs for reasons that are not real changes in ability: random sampling, prompt phrasing, the order of the options, the temperature setting. A model that scores 71% one day and 68% the next may not have changed at all. Treating small score differences as meaningful is the most common error in reading evals. Serious results report variance, run multiple times, and refuse to celebrate a one-point gain.
Self-preference and Position Bias
Two systematic flaws in LLM-as-judge evals. Self-preference: a model tends to rate its own outputs, and outputs in its own style, more highly than it should. Position bias: a judge shown two answers tends to favour whichever it sees first (or last), regardless of quality. Both mean a naive judge eval measures the judge's quirks as much as the answers' merits — which is why judges are calibrated against humans, prompts are randomised, and every comparison is run both ways round.
Sandbagging
A model underperforming on an eval on purpose — or being trained to — so it appears less capable than it really is. A live concern in safety evaluation: a system tested for dangerous capabilities might score low on the test and behave differently once deployed. Sandbagging turns the usual worry inside out: normally you fear a model is worse than its score suggests; here you fear it is better, and hiding it. Detecting it is an open problem.
Evals in practice
Eval-driven Development
Building AI features by writing the eval first, then changing the system until the score goes up — the AI analogue of test-driven development. Instead of tweaking prompts and judging by feel, you fix a measurable target and iterate against it. The discipline forces you to define "better" before you start, which is most of the battle. The teams that ship reliably are usually the ones that took this literally.
"Evals are the new unit tests." — a common refrain in applied AI.
Golden Dataset
A carefully curated, human-verified set of examples and correct answers, kept stable and trusted, that you evaluate against every time you change the system. The fixed point everything else is measured from. Building a good golden dataset — representative, correctly labelled, hard enough to matter — is slow, unglamorous, and the highest-leverage work in applied AI. Most teams wish they had started theirs sooner.
Annotation
The work of labelling data: writing the correct answers, rating the outputs, applying the rubric. Annotation produces the ground truth every eval depends on, and its quality sets the ceiling on everything downstream. It is painstaking human work, increasingly assisted by models, and a surprising share of an AI team's effort disappears into it. Bad annotation produces confident, precise, wrong evals.
Inter-annotator Agreement
How often two human annotators give the same label to the same example. A measure of whether a task is well defined: if your own experts only agree 60% of the time, no model can be scored reliably against their labels, and the first thing to fix is the rubric, not the model. High agreement means the task is crisp; low agreement means "correct" is contested, and the eval is built on sand.
Trace
The full record of what a model did to produce an output: every prompt, tool call, retrieved document, and intermediate step. Traces are what you eval against for agents and multi-step systems, where the final answer alone does not tell you whether the model got there for good reasons or by luck. Reading traces is how teams find out why a score is what it is — the difference between knowing the model failed and knowing where it failed.
Vibe Check
Informal, impressionistic evaluation: trying the model on a handful of prompts and forming a gut sense of whether it is good. Universally practised, frequently disparaged, genuinely useful as a first signal and useless as a basis for a decision. "It passes the vibe check" is half a joke and half an admission that the rigorous eval has not been built yet. Vibes find problems; numbers settle arguments.
Regression Eval
Re-running your evals after a change to make sure nothing that used to work has broken. A model or prompt update that improves one thing often quietly breaks another; the regression eval is the safety net that catches it before users do. Borrowed straight from software testing, where a "regression" is a bug in something that previously worked. Without one, every improvement is a gamble.
Guardrails
Checks that sit around a deployed model and block or correct bad outputs in real time: filters for unsafe content, validators that reject malformed responses, classifiers that catch off-topic answers. Related to evals — they often reuse the same graders — but they run live on every request rather than offline on a test set. An eval tells you how often the model fails; a guardrail catches the failure as it happens.
Safety and capability evals
Dangerous Capability Evals
Tests designed to measure whether a model can do something genuinely harmful: help synthesise a weapon, write effective malware, run a sophisticated cyberattack, or deceive and manipulate at scale. The point is not to score high but to know, before release, whether a capability that warrants restriction is present. These evals anchor the safety commitments the major labs have made — the threshold a model must stay under, or the safeguards it triggers if it does not.
Elicitation
The effort to draw out the best a model can do on a task, rather than the first thing it offers: better prompting, tool access, multiple attempts, fine-tuning. Elicitation matters most in safety evals, where you want the true ceiling of a dangerous capability, not a number that looks reassuringly low only because nobody tried hard enough. "We didn't find the capability" means little without "and we tried properly" — under-elicitation is how a real danger gets missed.
Agentic Evals
Evaluating a model acting as an agent — using tools, taking many steps, working in a real environment — rather than answering a single question. Far harder to build than question-and-answer benchmarks, because the model can succeed or fail in many more ways, and the environment has to be realistic enough to matter. SWE-bench is an agentic eval; so are the benchmarks that drop a model into a simulated computer and give it a goal. This is where eval design is currently hardest and moving fastest.
Sycophancy
A model's tendency to tell you what it thinks you want to hear: agreeing with a stated opinion, caving the moment it is challenged, flattering rather than correcting. It is measured by evals precisely because naive evals reward it — human raters and preference leaderboards both lean toward agreeable answers, which trains models toward sycophancy unless it is explicitly tested against. A model that always agrees with you is comfortable, and useless.
Jailbreak Eval
Testing how easily a model's safety training can be bypassed by adversarial prompts: role-play framings, encoded instructions, or elaborate hypotheticals that coax it past its guardrails. It measures robustness, not raw capability — a model can refuse a harmful request in the obvious form and comply when it is dressed up. Jailbreak evals are a permanently moving target, because every defence that gets published becomes the next attack's starting point.
Responsible Scaling Policy (RSP)
A published commitment by an AI lab that ties a model's release to the results of its safety evals: defined capability thresholds, the safeguards required at each, and the testing that must pass before training or deployment continues. Anthropic's RSP and the equivalent frameworks at other labs make evals load-bearing in the most literal sense — the eval result is the thing that gates the release. It is the clearest example of evals being used not to rank models, but to decide whether they ship at all.