ServiceNow benchmark finds agents 37.4% accurate

- ServiceNow researchers released EnterpriseOps-Gym, a public benchmark for enterprise AI agents, and the best model — Claude Opus 4.5 — completed only 37.4% of tasks. - The benchmark covers 1,150 tasks across eight enterprise domains; giving models human-written plans boosted success by 14 to 35 percentage points. - That matters because enterprise agent failures look less like bad clicking and more like bad workflow planning.

Enterprise AI agents are supposed to do the boring office work — update records, route tickets, answer HR requests, coordinate across apps. But the hard part was always going to be the messy middle, where a task unfolds over many steps, state changes underneath you, and one bad action can break a workflow. That is exactly what ServiceNow’s new EnterpriseOps-Gym benchmark tries to measure. The headline result is blunt: the best model in the test, Claude Opus 4.5, got only 37.4% of tasks fully right. (arxiv.org) ### What is this benchmark actually testing? EnterpriseOps-Gym is a simulated enterprise environment, not a toy quiz. It gives agents 1,150 expert-curated tasks across eight domains — including HR, IT service management, customer service, email, calendar, drive, teams, and hybrid workflows — inside a containerized setup with persistent state, 164 database tables, and 512 tools. Basically, the bench(arxiv.org)rk, where an agent has to search, decide, act, and keep track of consequences. (arxiv.org) ### Why is 37.4% such a bad number? Because this is not “got part of it right” scoring. The benchmark is about successful task completion in workflows that often span 7 to 30 steps. If the best system finishes barely more than one in three tasks correctly, that means current agents are still unreliable as autonomous workers in realistic enterprise settings. They can look fluent while still failing the actual job. (arxiv.org) ### Which model topped the leaderboard? The paper says Claude Opus 4.5 was the top performer at 37.4%, after evaluations on 14 frontier models. That matters less as a horse-race result than as a ceiling. If the leader is still under 40%, the story is not “one lab won.” The story is that the whole category is early when you move from chat demos to stateful business operations. (arxiv.org)ing — tools or planning? Turns out the main bottleneck looks like planning. The researchers ran “oracle” experiments where agents got human-authored plans before executing the task. Performance jumped by 14 to 35 percentage points across models. When the plan came from another model instead of a human, the gains were smaller — roughly 6 to 13 points. That is a big clue. The agen(arxiv.org)le to choose the right sequence of actions and adapt when the workflow branches. (emergentmind.com) ### Why does enterprise work make this harder? Because enterprise tasks are full of hidden constraints. Permissions differ by role. Records change while you work. One app depends on another. And many tasks are policy-heavy, not just mechanically complex. The benchmark notes especially weak performance in domains like ITSM and hybrid workflows compared with simpler collab(emergentmind.com) real operational logic, the faster the agent falls apart. (metaailabs.com) ### Does this mean enterprise agents are overhyped? Not exactly — but it does puncture the idea that today’s agents are ready to run unattended across serious workflows. The useful reading is narrower: agent execution is improving, but reliable workflow design still needs strong scaffolding. Human-written plans, guardrails, and review loops are not temporary training wheels. Right now they look like the product. (arxiv.org) ### What should companies take from it? Treat autonomy as a spectrum, not a switch. Use agents where the workflow is narrow, the blast radius is low, and a human can verify the result. Invest in planning layers and oversight before promising full automation. The benchmark’s real message is simple — enterprise AI is not being held back by clicking buttons. It is being held back by thinking through the job. (enterpriseops-gym.github.io)

ServiceNow benchmark finds agents 37.4% accurate

Get your own daily briefing