Agent benchmarks jump — still uneven

Stanford’s AI Index finds agents’ OSWorld task success rose from roughly 12% to about 66%, a major increase but still implying roughly one failure in three attempts. (digit.in) (spectrum.ieee.org)

Artificial intelligence agents that use a computer like a person now finish about two-thirds of OSWorld tasks, up from about one in eight a year earlier. (hai.stanford.edu) Stanford’s 2026 Artificial Intelligence Index, released April 13, said agent success on OSWorld rose from 12.24% in the benchmark’s 2024 paper to roughly 66% in the latest results it tracks. The report also said that still leaves agents failing about one out of every three attempts on a structured test. (hai.stanford.edu) OSWorld is a test bed for “computer-use” agents: systems that look at a screen, click buttons, type, move files, and switch between apps to complete full tasks. The benchmark’s creators published it in April 2024 with 369 tasks across Ubuntu, Windows, and macOS, drawn from ordinary desktop and web workflows. (arxiv.org) (os-world.github.io) The original OSWorld paper found humans completed 72.36% of the tasks while the best model at the time managed 12.24%. That gap made the benchmark a useful way to test whether an agent could carry out a long sequence of actions rather than just answer a question. (arxiv.org) The jump in scores lands as companies are pushing agents that can book travel, update spreadsheets, handle customer-service screens, and operate business software without custom integrations. Stanford’s report says evaluation is becoming more important as AI systems move from chat windows into workplaces, classrooms, and public services. (hai.stanford.edu 1) (hai.stanford.edu 2) The benchmark still measures a cleaned, repeatable environment, not a messy live office desktop with surprise pop-ups, policy checks, and changing interfaces. OSWorld’s own site says it was updated in July 2025 as “OSWorld-Verified,” with fixed examples, faster evaluation, and revised benchmark results. (os-world.github.io) Researchers are also finding that reliability depends on how agents are run, not just which model they use. A February 2026 paper from Simular Research reported 72.6% on OSWorld by generating multiple attempts in parallel and selecting the best behavior, slightly above the 72.36% human result reported in the original benchmark. (arxiv.org 1) (arxiv.org 2) That means the headline number mixes two realities at once: agents are much better at operating software than they were in 2024, and they are still uneven enough that repeated tries, judging systems, or human oversight can change the outcome. (hai.stanford.edu) (arxiv.org)

Agent benchmarks jump — still uneven

Get your own daily briefing