APEX-Agents: frontier models fail tasks
- Mercor researchers published APEX-Agents in January, a new benchmark testing eight frontier AI agents on 480 workplace tasks built by bankers, consultants, and lawyers inside realistic software environments. - The best reported Pass@1 score was 24.0% for Gemini 3 Flash with high reasoning, leaving roughly three out of four long-horizon tasks failing on a single run. - The paper adds evidence that agent performance drops when work spans many tools, files, and turns, not just one prompt. (arxiv.org)
APEX-Agents is a new benchmark showing that frontier AI agents still fail most long, real-world office tasks on the first try. (arxiv.org) Mercor’s researchers said the benchmark measures whether agents can execute cross-application work created by investment banking analysts, management consultants, and corporate lawyers. The dataset includes 480 tasks and tests eight agents with a Pass@1 metric, which means one shot, no retries. (arxiv.org) (huggingface.co) In plain language, these are not trivia questions. The agents have to move through files and tools such as documents, spreadsheets, PDFs, email, chat, and calendars, then produce work products that satisfy a rubric. (arxiv.org) (huggingface.co) The top leaderboard result in the paper was 24.0% Pass@1 for Gemini 3 Flash with “Thinking=High.” The paper said GPT-5.2 with high reasoning, Claude Opus 4.5 with high reasoning, and Gemini 3 Pro with high reasoning followed behind it. (arxiv.org) That means even the best system in the paper failed about 76% of tasks on a single run. Put differently, long-horizon professional work still breaks current agents far more often than headline demos suggest. (arxiv.org) (huggingface.co) The benchmark was built from simulated work projects that industry professionals carried out over five to 10 days before turning those materials into tasks. Those tasks were then graded with 1 to 10 criteria each, with a judge model checking whether the agent’s output and file changes met every requirement. (arxiv.org) (huggingface.co) This is the hard part of agent evaluation: errors can compound across many turns. Anthropic wrote in January that agent evals are tougher than single-turn tests because agents call tools, modify state, and adapt as they go, so one mistake can poison the rest of the run. (anthropic.com) OpenAI has described the same problem from the other direction. In a February post about long-horizon coding with Codex, the company said the practical shift is whether agents can stay coherent, validate work, and repair failures over time, not just answer one prompt well. (developers.openai.com) APEX-Agents does not say agents are useless. It says reliable deployment still depends on the surrounding system: retries, checks, grading, and monitoring matter because the model alone does not complete most end-to-end tasks in one pass. (arxiv.org) (anthropic.com) The paper’s closing point is narrower than the hype cycle. Frontier agents can do some professional work, but on this benchmark, first-attempt reliability is still the exception rather than the rule. (arxiv.org)