Evals Are Not Optional

Shipping a model is not finishing. It is starting.

The moment a model is deployed, the question that matters is not "how did it score on MMLU?" — it is "does it work, reliably, on the actual inputs your users send?" Those are different questions. Conflating them is how teams end up confident in benchmarks that measure nothing they care about.

Evals are not optional. They are the only honest feedback loop available to an AI engineer.

---

What an Eval Actually Is

An eval is a reproducible measurement of model behaviour on a defined input distribution.

Three words matter: reproducible, defined, behaviour.

Reproducible: you can run it again tomorrow and get a comparable result. If you cannot, you have anecdote, not data.
Defined: the input distribution reflects the task you actually care about, not the task that is convenient to score.
Behaviour: what the model does, not what it knows in the abstract.

A benchmark score is an eval of someone else's defined distribution. It may or may not overlap with yours. Importing a number from a leaderboard and calling it your eval is not evaluation. It is wishful thinking.

---

The Contamination Problem

Every widely-used benchmark is partially contaminated.

This is not speculation. It is measured.

MMLU, the most-cited benchmark in model releases, has contamination rates of 0.52–0.69 overlap with common training corpora (The Pile, C4). After filtering leaked instances in STEM subjects, accuracy drops by up to 8 percentage points. The Virology subset has a 57% error rate — incorrect ground truth labels and scraping artifacts, independent of contamination.

HellaSwag shows contamination signals across every detection method applied. GSM8K has approximately 5% serious errors. XSUM reference summaries are rated worse than the model outputs they are supposed to grade.

The benchmarks were not built to be gamed. They were built as proxies. But once a proxy is used for optimization, it ceases to measure what it was a proxy for. This is Goodhart's Law, applied to intelligence measurement.

New contamination-free alternatives are emerging — MMLU-CF, MMLU-Redux, MMLU-Pro — but the structural problem persists: any static benchmark becomes stale the moment a model is trained near its distribution.

The conclusion is not that benchmarks are useless. The conclusion is that benchmarks are not evals. They are inputs to a judgement, not the judgement itself.

---

The Three Purposes of Evaluation

Not all evals serve the same goal. Conflating them creates confusion about what to measure and how.

1. Capability Assessment

What can this model do? Academic benchmarks (MMLU, GPQA Diamond, HumanEval, MATH) are appropriate here — with the caveat that contamination limits absolute claims. Their value is relative: is this checkpoint better than the previous one?

2. Task Fitness

Does this model perform on my specific task? This requires task-specific evals you build. A model that scores 80% on MMLU may fail catastrophically on your structured extraction pipeline. You do not know until you measure.

3. Regression Detection

Did this change make things worse? CI-integrated evals catch regressions before they reach production. This is the most actionable evaluation type and the most neglected.

---

Building a Useful Eval Pipeline

Step 1: Define the task precisely

What is the model supposed to do? What counts as correct? If you cannot write a grader, your task is underspecified. Underspecified tasks produce unmeasurable evals.

Step 2: Collect representative inputs

Not cherry-picked inputs. Not the happy path. The actual distribution of requests — including the edge cases, the ambiguous cases, the adversarial inputs your users will eventually send.

At minimum: 50–100 examples for a coarse signal. 500+ for anything you are making a production decision on.

Step 3: Choose a grader

Three grader types exist:

String graders: exact or regex match. Fast, deterministic. Appropriate for code, SQL, structured output, factual extraction with known answers.
Model graders: use an LLM to judge the output against a rubric. Appropriate for open-ended generation, reasoning steps, summarization quality. Introduce their own bias — a model judging itself or a sibling model inflates scores.
Programmatic graders: custom Python logic. Appropriate for anything that can be unit-tested — JSON schema validity, function call correctness, constraint satisfaction.

Use the simplest grader that captures the failure mode you care about. Model graders are not a substitute for precision.

Step 4: Version everything

The eval dataset. The grader. The model. The prompt template. If any of these change without documentation, comparative results are worthless. lm-evaluation-harness encodes this discipline — task versioning is first-class.

Step 5: Run in CI

Evals that only run before a release catch release-time regressions. Evals that run on every significant change catch the actual regressions — the ones introduced three commits ago by a prompt tweak you thought was harmless.

---

The Frameworks

Three frameworks dominate open-source evaluation:

EleutherAI lm-evaluation-harness

The de facto standard for academic benchmarking. Unified interface across 200+ tasks. Task versioning built in. Integrates with vLLM and HuggingFace TGI as backends. If you are comparing models or reporting numbers publicly, use this. Reproducibility is enforced by design.

HELM (Holistic Evaluation of Language Models)

Developed at Stanford. Broader than lm-eval: it benchmarks not just accuracy but also calibration, robustness, fairness, efficiency, and bias. Useful when the question is "how does this model behave across a wide surface?" rather than "does it beat the previous checkpoint on one task?"

OpenAI Evals

Framework designed for CI/CD integration and production monitoring. Supports string, model, and Python graders. Best suited for task-specific pipelines where the goal is regression detection, not academic ranking.

These frameworks are complementary, not competing. A rigorous shop uses all three at different stages.

---

What Not To Do

Do not report a single number. Models are not scalar. A single accuracy figure hides failure modes on subsets that matter. Disaggregate by domain, difficulty, and input type.

Do not evaluate only on your training distribution. If your eval set was assembled by the same process that generated your training data, you are measuring memorization.

Do not use model-graded evals as the only signal for model-graded tasks. The model has incentives (via RLHF) to produce text that scores well with model judges. Human spot-checks are not optional for critical tasks.

Do not treat absence of regression as presence of quality. Not regressing means you did not get worse. It does not mean you are good.

Do not skip evals because the task is hard to grade. If the task is hard to grade, the model's outputs are also hard to trust. The difficulty of the grader is a signal about the reliability of the deployment.

---

Evals as Engineering Practice

The engineering framing is more useful than the research framing.

In research, evals answer: "Is this model good?"

In engineering, evals answer: "Is this system doing what it is supposed to do, and did this change break anything?"

That is a question every production system must answer continuously. A model in production is a component with a contract. The contract is: given these inputs, produce these outputs, within these quality bounds. Evals are how you verify the contract is being met.

Teams that skip evals are not moving faster. They are accumulating debt that will be paid in production incidents, user complaints, and silent degradations they have no instrumentation to detect.

---

The Minimal Viable Eval

If you have nothing today, start here:

1. Pick 50 real inputs from your production logs or expected use case.

2. Label the expected output or outcome for each.

3. Write a string or programmatic grader.

4. Run it. Record the baseline score.

5. Run it again after every significant model or prompt change.

That is a minimal viable eval. It is not sophisticated. It is enough to catch the regressions that actually matter.

Sophistication comes later. The baseline comes first.

---

Summary

Benchmarks are not evals. They are inputs to a judgment about capability.
Contamination is widespread and material. MMLU, HellaSwag, and GSM8K are all affected.
Evals serve three purposes: capability assessment, task fitness, and regression detection. Build for all three.
A useful eval pipeline has: a precise task definition, representative inputs, a deterministic grader, versioning, and CI integration.
The frameworks that enforce this discipline: lm-evaluation-harness (benchmarking), HELM (holistic assessment), OpenAI Evals (CI/CD).
If you have not run an eval on your production task, you do not know if your system works. You have a belief.

Beliefs are not engineering.

---

References

1. Gonen, H., et al. (2024). "Investigating Data Contamination in Modern Benchmarks for Large Language Models." Proceedings of NAACL 2024. https://aclanthology.org/anthology-files/anthology-files/pdf/naacl/2024.naacl-long.482.pdf

2. Alzahrani, N., et al. (2025). "Are We Done with MMLU?" Proceedings of NAACL 2025. https://aclanthology.org/2025.naacl-long.262.pdf

3. Xu, C., et al. (2024). "Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter?" arXiv:2411.03923. https://arxiv.org/html/2411.03923

4. Biderman, S., et al. (2024). "Lessons from the Trenches on Reproducible Evaluation of Language Models." arXiv:2405.14782. https://arxiv.org/pdf/2405.14782

5. EleutherAI. (2023). "Evaluating LLMs: lm-evaluation-harness." EleutherAI. https://www.eleuther.ai/projects/large-language-model-evaluation

6. Liang, P., et al. (2022). "HELM: Holistic Evaluation of Language Models." Stanford CRFM. arXiv:2211.09110. https://arxiv.org/abs/2211.09110

7. Myrzakhan, A., et al. (2025). "Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation." Proceedings of eval4nlp at EMNLP 2025. https://aclanthology.org/2025.eval4nlp-1.3.pdf

8. Singh, S., et al. (2024). "How Much Are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library." arXiv:2404.00699. https://arxiv.org/html/2404.00699v1

9. ICML 2024 Tutorial. "Challenges in LM Evaluation." https://lm-evaluation-challenges.github.io/%5BMain%5D%20ICML%20Tutorial%202024%20-%20Challenges%20in%20LM%20Evaluation.pdf

10. Polo, F.M., et al. (2024). "Open-Source Data Contamination Report for Large Language Models." Findings of EMNLP 2024. https://aclanthology.org/2024.findings-emnlp.30.pdf