Top LLMs within 0.7% gap
- Frontier LLMs are now bunched so tightly on old headline tests that MMLU, GSM8K, and HumanEval barely separate the top commercial models anymore. - Newer work sharpened the point this week: ProgramBench’s 200-task clean-room coding test had no model fully solve any task end to end. - That shifts model buying away from leaderboard decimals and toward workload-specific evals on tools, latency, cost, reliability, and failure modes.
The benchmark story in AI has changed. Not because models stopped improving, but because the old tests stopped telling you much. On familiar headline benchmarks like MMLU, GSM8K, and HumanEval, frontier models are now packed near the ceiling, which makes tiny score differences look bigger than they are. And this week’s new evidence pushed the point harder — ProgramBench, a tougher software benchmark released on May 5, 2026, found that none of nine tested models fully resolved any task. (lxt.ai) ### Why are the top models suddenly “tied”? Basically, they are not literally tied. They are benchmark-tied. Old tests like MMLU, GSM8K, and HumanEval were useful when models were far apart. But once most frontier systems score in the 90s, the remaining spread gets so small that it stops being a good proxy for real capability differences. Several 2026 benchmark trackers now treat those tests as sat(lxt.ai)ut because they no longer separate the leaders cleanly. (lxt.ai) ### What does “saturated” actually mean? It means the benchmark has lost resolution at the top. If five models all get almost everything right, a 0.5-point edge can come from prompting choices, contamination, formatting quirks, or simple statistical noise rather than a meaningful jump in usefulness. A February 2026 saturation study looked across 60 LLM benchmarks and found nearly half already showe(lxt.ai)plateau. (arxiv.org) ### So what changed this week? ProgramBench gave people a cleaner example of the gap between benchmark glory and real work. The benchmark asks an agent to rebuild full software projects from an executable and documentation, not just patch a bug or complete a function. It spans 200 tasks, from small command-line tools to things like SQLite, FFmpeg, and the PHP interpreter. None of the nine evaluated lan(arxiv.org)t model passed 95% of tests on only 3% of tasks. (arxiv.org) ### Why is that a harsher test? Because whole-project coding is architecture, not autocomplete. A model has to choose languages, structure files, handle errors, wire build systems, and keep decisions consistent across a codebase. HumanEval mostly checks whether a model can write a function. ProgramBench checks whether a model can act like a software engineer over a longer horizon. Those are related skills — but not the same skill. (arxiv.org) ### Does that mean the old benchmarks are useless? No — they still measure something real. MMLU still says something about broad knowledge. GSM8K still says something about school-math reasoning. HumanEval still says something about code synthesis. The catch is that they are now better for tracking historical progress than for choosing between the very best current models. Epoch AI’s 2026 benchmark (arxiv.org)r composite and harder frontier evaluations rather than just the old headline trio. (epoch.ai) ### What should engineers compare instead? Use evaluations that look like your workload. If you need coding agents, test repository-scale tasks. If you need customer support, test policy adherence, tone, latency, and recovery from bad context. If you need tool use, run tool-use benchmarks and your own traces. Price, speed, context handling, and reliability under repetition often matter more in production than a tiny benchmark edge. (openreview.net) ### Why does this matter for buyers? Because the market still loves a leaderboard screenshot. When public benchmarks compress, small deltas get turned into big marketing claims. But for teams spending real money, the practical question is no longer “Which model won MMLU?” It is “Which model fails least expensively on my exact task?” That is a much less glamorous question — and a much more useful one. (arxiv.org) ### Bottom line The frontier did not flatten. The measuring sticks did. Old benchmarks still show that LLMs got very good. They just no longer tell you, with much confidence, which top model is best for the job in front of you. (lxt.ai)