Agent benchmarks show uneven gains — some suites post only low double‑digit (~12%) improvements
- Official and public agent benchmarks now show a wide spread rather than one clean leap: SWE-bench Verified leaders score 76.8%, GAIA leaders top 92.36%, and WebArena reports remain far lower. - The sharpest contrast is between coding and web-task suites: SWE-bench’s official leaderboard shows top systems around the mid-70s, while WebArena reports cited by labs and trackers cluster from the teens to low-70s. - Enterprise rollout still trails benchmark gains: McKinsey said no more than 10% of firms are scaling agents in any single function, while PwC found broad use often stops at routine productivity boosts. (mckinsey.com) (pwc.com)
An AI agent benchmark is a test for software that can plan, click, search, code, or use tools on its own. The latest public scoreboards show those agents improving unevenly across different kinds of tasks. (webarena.dev) (swebench.com) The cleanest gains are in coding. The official SWE-bench Verified leaderboard shows Claude 4.5 Opus at 76.8% resolved, with Gemini 3 Flash and MiniMax M2.5 at 75.8% and GPT-5-2 Codex at 72.8%. (swebench.com) A different benchmark, GAIA, tests broader assistant work such as reasoning, search, and tool use across more than 450 questions. Its public leaderboard shows top systems above 90%, with OPS-Agentic-Search at 92.36% on March 11, 2026. (huggingface.co) WebArena is tougher to compare at a glance because it measures agents navigating live-style websites to finish multi-step tasks. The project describes it as a realistic web environment, and public result trackers show a much wider spread there than on coding leaderboards. (webarena.dev) (leaderboard.steel.dev) That spread is the point. A model that patches software bugs at 70%-plus on SWE-bench can still struggle much more on browser tasks, where the agent has to read pages, choose actions, recover from mistakes, and handle changing site state. (swebench.com) (arxiv.org) The benchmark names also hide different scoring rules. SWE-bench Verified is a human-filtered subset of 500 software issues, while GAIA reports the best run on a test set and WebArena uses task completion in a self-hosted web environment. (swebench.com) (huggingface.co) (github.com) That makes headline percentage jumps hard to compare across suites. A 12-point gain on one benchmark may reflect a very different task mix, evaluator, and failure mode than a 12-point gain on another. (github.com) (huggingface.co) (swebench.com) Company surveys show the same gap between lab scores and daily work. McKinsey’s 2025 State of AI survey found 23% of organizations scaling an agentic AI system in at least one function, but in any given function no more than 10% reported scaling agents. (mckinsey.com) PwC’s May 2025 survey found 79% of companies said AI agents were being adopted, yet 68% said half or fewer employees interacted with them in everyday work. PwC said many deployments are still embedded helpers that speed up routine tasks rather than rewiring whole workflows. (pwc.com) The result is a market where scoreboards can look strong and deployment can still look shallow. The newest agent numbers show real progress, but they also show that “agent performance” depends heavily on which job the agent is actually asked to do. (mckinsey.com) (pwc.com)