Agentic evaluation gaps emerge
Agentic systems that plan, use tools and recover from errors are exposing evaluation gaps that single-turn benchmarks miss, and a new leaderboard aims to measure that. (x.com) APEX-Agents-AA evaluates long-horizon tasks with real tool dependencies and error recovery, highlighting failures around persistence and workflow completion rather than just conversational quality. (x.com) Researchers note agents succeed on verifiable rewards but struggle on open-ended tasks, which means labs will need trace-level labels, failure taxonomies and harnesses for memory and disagreement quotas. ( )
Most artificial intelligence tests still work like an oral exam: one prompt goes in, one answer comes out, and a grader checks the final sentence. Real agents work more like interns, because they open files, call tools, update state, and keep going for many turns before they are done. (anthropic.com) That mismatch is why new agent benchmarks are suddenly exposing failures that chat benchmarks barely see. Anthropic’s engineering team wrote in January 2026 that agent evaluations have to measure many-turn behavior because mistakes can “propagate and compound” once a system starts acting in an environment. (anthropic.com) The new benchmark getting attention is called the Artificial Intelligence Productivity Index for Agents, or APEX-Agents. It was posted to arXiv on January 20, 2026, and it tests whether agents can finish long, cross-application tasks designed by investment banking analysts, management consultants, and corporate lawyers. (arxiv.org) Those tasks are not trivia questions. The APEX-Agents paper says agents have to navigate realistic work environments with files and tools, which turns the test from “can it talk about work” into “can it actually do the work.” (arxiv.org) The original APEX-Agents release includes 480 tasks and scores systems with pass at one, which means a run counts only if it fully satisfies the rubric on the first try. In the paper’s own leaderboard, the top score was 24.0%, which means even the best tested system failed about three out of four tasks. (arxiv.org) Artificial Analysis then built an independent version called APEX-Agents-AA using its Stirrup Agent Harness. Its public leaderboard says it evaluates 452 tasks from the APEX dataset and drops two worlds because those worlds depend on external application programming interfaces. (artificialanalysis.ai) That independent leaderboard also changes the conversation from “which model sounds smartest” to “which agent actually finishes the workflow.” Artificial Analysis says its headline metric is still pass at one success rate, defined as the share of tasks where a model fully satisfies the grading rubric rather than earning partial credit across criteria. (artificialanalysis.ai) As of April 2026, the top of APEX-Agents-AA is tightly packed rather than dominated by one runaway winner. Artificial Analysis lists GPT-5.4 at 33.3%, Claude Opus 4.6 at 33.0%, and Gemini 3.1 Pro Preview at 32.0%, which is better than the original paper’s 24.0% ceiling but still far from reliable completion. (artificialanalysis.ai) The gap between chat quality and task completion shows up most clearly on long runs. OpenAI wrote in February 2026 that long-running agent work is less about one giant prompt and more about a loop of planning, editing, running tools, observing results, and repairing failures. (developers.openai.com) Once you look at that loop, the weak spots are different from the ones older benchmarks measured. What breaks is often persistence, staying on task, keeping track of intermediate state, and recovering after a bad tool call, not basic fluency or polished wording. (developers.openai.com) (anthropic.com) That is why researchers are pushing for more than a single final score. Anthropic says an evaluation harness has to record all the steps, run the tools, grade outputs, and aggregate results, because with agents you are evaluating the model and the scaffold around it together. (anthropic.com) The next wave of testing will probably look less like a school exam and more like a flight recorder. If labs want agents that can work for hours instead of minutes, they will need traces of every step, labels for where runs went wrong, and benchmarks that punish dropped threads and unfinished workflows instead of rewarding a convincing final paragraph. (anthropic.com) (artificialanalysis.ai)