Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Shipping a model is not finishing. It is starting.
The moment a model is deployed, the question that matters is not "how did it score on MMLU?" — it is "does it work, reliably, on the actual inputs your users send?" Those are different questions. Conflating them is how teams end up confident in benchmarks that measure nothing they care about.
Evals are not optional. They are the only honest feedback loop available to an AI engineer.
---
An eval is a reproducible measurement of model behaviour on a defined input distribution.
Three words matter: reproducible, defined, behaviour.
A benchmark score is an eval of someone else's defined distribution. It may or may not overlap with yours. Importing a number from a leaderboard and calling it your eval is not evaluation. It is wishful thinking.
---
Every widely-used benchmark is partially contaminated.
This is not speculation. It is measured.
MMLU, the most-cited benchmark in model releases, has contamination rates of 0.52–0.69 overlap with common training corpora (The Pile, C4). After filtering leaked instances in STEM subjects, accuracy drops by up to 8 percentage points. The Virology subset has a 57% error rate — incorrect ground truth labels and scraping artifacts, independent of contamination.
HellaSwag shows contamination signals across every detection method applied. GSM8K has approximately 5% serious errors. XSUM reference summaries are rated worse than the model outputs they are supposed to grade.
The benchmarks were not built to be gamed. They were built as proxies. But once a proxy is used for optimization, it ceases to measure what it was a proxy for. This is Goodhart's Law, applied to intelligence measurement.
New contamination-free alternatives are emerging — MMLU-CF, MMLU-Redux, MMLU-Pro — but the structural problem persists: any static benchmark becomes stale the moment a model is trained near its distribution.
The conclusion is not that benchmarks are useless. The conclusion is that benchmarks are not evals. They are inputs to a judgement, not the judgement itself.
---
Not all evals serve the same goal. Conflating them creates confusion about what to measure and how.
What can this model do? Academic benchmarks (MMLU, GPQA Diamond, HumanEval, MATH) are appropriate here — with the caveat that contamination limits absolute claims. Their value is relative: is this checkpoint better than the previous one?
Does this model perform on my specific task? This requires task-specific evals you build. A model that scores 80% on MMLU may fail catastrophically on your structured extraction pipeline. You do not know until you measure.
Did this change make things worse? CI-integrated evals catch regressions before they reach production. This is the most actionable evaluation type and the most neglected.
---
What is the model supposed to do? What counts as correct? If you cannot write a grader, your task is underspecified. Underspecified tasks produce unmeasurable evals.
Not cherry-picked inputs. Not the happy path. The actual distribution of requests — including the edge cases, the ambiguous cases, the adversarial inputs your users will eventually send.
At minimum: 50–100 examples for a coarse signal. 500+ for anything you are making a production decision on.
Three grader types exist:
Use the simplest grader that captures the failure mode you care about. Model graders are not a substitute for precision.
The eval dataset. The grader. The model. The prompt template. If any of these change without documentation, comparative results are worthless. lm-evaluation-harness encodes this discipline — task versioning is first-class.
Evals that only run before a release catch release-time regressions. Evals that run on every significant change catch the actual regressions — the ones introduced three commits ago by a prompt tweak you thought was harmless.
---
Three frameworks dominate open-source evaluation:
The de facto standard for academic benchmarking. Unified interface across 200+ tasks. Task versioning built in. Integrates with vLLM and HuggingFace TGI as backends. If you are comparing models or reporting numbers publicly, use this. Reproducibility is enforced by design.
Developed at Stanford. Broader than lm-eval: it benchmarks not just accuracy but also calibration, robustness, fairness, efficiency, and bias. Useful when the question is "how does this model behave across a wide surface?" rather than "does it beat the previous checkpoint on one task?"
Framework designed for CI/CD integration and production monitoring. Supports string, model, and Python graders. Best suited for task-specific pipelines where the goal is regression detection, not academic ranking.
These frameworks are complementary, not competing. A rigorous shop uses all three at different stages.
---
Do not report a single number. Models are not scalar. A single accuracy figure hides failure modes on subsets that matter. Disaggregate by domain, difficulty, and input type.
Do not evaluate only on your training distribution. If your eval set was assembled by the same process that generated your training data, you are measuring memorization.
Do not use model-graded evals as the only signal for model-graded tasks. The model has incentives (via RLHF) to produce text that scores well with model judges. Human spot-checks are not optional for critical tasks.
Do not treat absence of regression as presence of quality. Not regressing means you did not get worse. It does not mean you are good.
Do not skip evals because the task is hard to grade. If the task is hard to grade, the model's outputs are also hard to trust. The difficulty of the grader is a signal about the reliability of the deployment.
---
The engineering framing is more useful than the research framing.
In research, evals answer: "Is this model good?"
In engineering, evals answer: "Is this system doing what it is supposed to do, and did this change break anything?"
That is a question every production system must answer continuously. A model in production is a component with a contract. The contract is: given these inputs, produce these outputs, within these quality bounds. Evals are how you verify the contract is being met.
Teams that skip evals are not moving faster. They are accumulating debt that will be paid in production incidents, user complaints, and silent degradations they have no instrumentation to detect.
---
If you have nothing today, start here:
1. Pick 50 real inputs from your production logs or expected use case.
2. Label the expected output or outcome for each.
3. Write a string or programmatic grader.
4. Run it. Record the baseline score.
5. Run it again after every significant model or prompt change.
That is a minimal viable eval. It is not sophisticated. It is enough to catch the regressions that actually matter.
Sophistication comes later. The baseline comes first.
---
Beliefs are not engineering.
---
1. Gonen, H., et al. (2024). "Investigating Data Contamination in Modern Benchmarks for Large Language Models." Proceedings of NAACL 2024. https://aclanthology.org/anthology-files/anthology-files/pdf/naacl/2024.naacl-long.482.pdf
2. Alzahrani, N., et al. (2025). "Are We Done with MMLU?" Proceedings of NAACL 2025. https://aclanthology.org/2025.naacl-long.262.pdf
3. Xu, C., et al. (2024). "Evaluation Data Contamination in LLMs: How Do We Measure It and (When) Does It Matter?" arXiv:2411.03923. https://arxiv.org/html/2411.03923
4. Biderman, S., et al. (2024). "Lessons from the Trenches on Reproducible Evaluation of Language Models." arXiv:2405.14782. https://arxiv.org/pdf/2405.14782
5. EleutherAI. (2023). "Evaluating LLMs: lm-evaluation-harness." EleutherAI. https://www.eleuther.ai/projects/large-language-model-evaluation
6. Liang, P., et al. (2022). "HELM: Holistic Evaluation of Language Models." Stanford CRFM. arXiv:2211.09110. https://arxiv.org/abs/2211.09110
7. Myrzakhan, A., et al. (2025). "Simulating Training Data Leakage in Multiple-Choice Benchmarks for LLM Evaluation." Proceedings of eval4nlp at EMNLP 2025. https://aclanthology.org/2025.eval4nlp-1.3.pdf
8. Singh, S., et al. (2024). "How Much Are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library." arXiv:2404.00699. https://arxiv.org/html/2404.00699v1
9. ICML 2024 Tutorial. "Challenges in LM Evaluation." https://lm-evaluation-challenges.github.io/%5BMain%5D%20ICML%20Tutorial%202024%20-%20Challenges%20in%20LM%20Evaluation.pdf
10. Polo, F.M., et al. (2024). "Open-Source Data Contamination Report for Large Language Models." Findings of EMNLP 2024. https://aclanthology.org/2024.findings-emnlp.30.pdf