Agent Benchmarks Lag

A new CocoaBench v1.0 benchmark for unified digital agents reports that top agents only reach 45.1% success on open‑world tasks that combine coding, vision and search. (x.com) The benchmark also notes a notable performance gap between frontier proprietary models and open-source alternatives. (x.com)

A digital agent is software that can browse websites, read screens, search for facts, run code, and submit an answer. A new benchmark called CocoaBench says the best tested system still solved only 45.1% of those mixed tasks. (arxiv.org) The benchmark was released as CocoaBench v1.0 on April 7, 2026, and the paper was posted to arXiv on April 13, 2026. The dataset has 153 human-authored, long-horizon tasks that combine vision, search, and coding in one workflow. (cocoabench.github.io) (arxiv.org) In plain terms, CocoaBench tests whether one agent can do the kind of chained work a person does on a laptop: inspect a screen, gather information from the web, use a terminal, and return a final result. The project says each task is defined by an instruction and an automatic evaluation script, with no large language model acting as the judge. (cocoabench.github.io) That setup differs from benchmarks such as software-only coding tests or question-answer sets with fixed prompts. The CocoaBench authors wrote that many existing evaluations test these abilities in isolation, not in the combined form used by newer agent products. (arxiv.org) (cocoabench.github.io) The researchers also released CocoaAgent, a shared scaffold meant to hold the tool setup constant while swapping model backbones underneath. The GitHub repository says the framework equips agents with a browser, terminal, file operations, and a code interpreter through Sandbox-AIO. (cocoabench.github.io) (github.com) The project site says the weak spots were reasoning and planning, tool use and execution, and visual grounding, which is the step where a model has to correctly tie text instructions to what it sees on screen. The example tasks published on the site include a hidden-goal eight-puzzle, a nutrition shopping task, and other jobs that require switching between interfaces and calculations. (cocoabench.github.io) CocoaBench’s public leaderboard page still shows an older v0.1 table with 25 tasks rather than the new v1.0 release. That v0.1 board lists ChatGPT Agent at 44%, Gemini-3 Pro Thinking at 36%, GPT-5.1 extended thinking at 32%, Claude-Opus-4.5 extended at 28%, OpenAI Deep Research at 20%, and Gemini-3 Pro Thinking plus DeepResearch at 16%. (cocoabench.github.io) The v1.0 paper and site say there is a sizable gap between frontier proprietary systems and open-source alternatives, but the snippets now publicly visible on the project pages do not list the full v1.0 model-by-model table. What they do show is the headline result: even the top system failed more than half the time on tasks designed to look like real computer work. (arxiv.org) (cocoabench.github.io) That leaves CocoaBench less as a victory lap than a stress test. On a benchmark built to mimic ordinary screen-and-browser work, the ceiling is still below half. (arxiv.org)

Agent Benchmarks Lag

Get your own daily briefing