Benchmarks losing bite
- Researchers say standard benchmarks now cluster top models so leaderboards fail to show real differences. - Stanford's annual report notes most major models score above 88% on MMLU, compressing leaderboard gaps. - Yet those same models still fail roughly one‑third of real‑world tasks, prompting debate about benchmark usefulness. (x.com)
Artificial intelligence benchmarks are starting to bunch the best models together, making old leaderboards less useful for telling them apart. (hai.stanford.edu) A benchmark is a standardized test for models, like a final exam with the same questions for every system. One of the best-known, Massive Multitask Language Understanding, or MMLU, uses multiple-choice questions across 57 subjects including math, history, law, and computer science. (crfm.stanford.edu) Stanford’s 2026 Artificial Intelligence Index, published April 13, said evaluation is getting “increasingly difficult to rely on” as models improve faster than the tests built to measure them. Stanford’s companion summary said top systems are now separated by “razor-thin margins.” (hai.stanford.edu 1) (hai.stanford.edu 2) The compression shows up on familiar leaderboards. Stanford’s Holistic Evaluation of Language Models page lists Claude 3.5 Sonnet at 87.3% on MMLU, DeepSeek v3 at 87.2%, Gemini 1.5 Pro at 86.9%, and GPT-4o at 84.3%, a narrow spread for models from different labs. (crfm.stanford.edu) Stanford made the same point in last year’s index, saying traditional tests such as MMLU, GSM8K, and HumanEval were already hitting saturation. The 2025 report said the gap between the top two models on Chatbot Arena shrank from 4.9% in 2023 to 0.7% in 2024. (hai.stanford.edu) Researchers are responding by building harder exams. The 2025 Stanford report highlighted Humanity’s Last Exam, where the top system scored 8.8%, FrontierMath, where systems solved 2% of problems, and BigCodeBench, where models reached 35.5% against a human standard of 97%. (hai.stanford.edu) Those newer tests are aimed at tasks closer to work than quiz bowls. Humanity’s Last Exam was introduced in January 2025 after its authors wrote that models were already scoring above 90% on popular benchmarks like MMLU, limiting measurement at the frontier. (arxiv.org) Real-world agent benchmarks still show wider gaps. Stanford’s April 13 summary said success rates on Terminal-Bench, which uses 89 terminal tasks drawn from software engineering, machine learning, security, and system administration workflows, rose from 20% in 2025 to 77.3% in 2026. (hai.stanford.edu) (arxiv.org) Other real-world tests remain harder. Tau-bench, introduced in June 2024 to measure agents handling dynamic conversations, tools, and domain rules in settings like retail and airlines, was built because its authors said existing benchmarks did not test those deployment conditions. (arxiv.org) The debate now is less about whether models can ace old exams than about what counts as a useful exam. Stanford’s 2026 report says the measurement problem is no longer just tracking capability gains, but building tests that stay hard long enough to show who is actually better. (hai.stanford.edu)