ServiceNow benchmark exposes agent limits

ServiceNow’s EnterpriseOps‑Gym benchmark stresses realistic agent tasks—164 DB tables and 512 tools—and top models only reached ~37.4% success, underlining that planning and orchestration, not raw model size, are the failure points. That’s a stark reminder that agent reliability needs structured evaluation and tool-aware planning before rollouts. (x.com)

ServiceNow Research led the paper with collaborators at Mila and Université de Montréal and first authors Shiva Krishna Reddy Malay and Shravan Nayak listed on the arXiv submission. (arxiv.org) The benchmark runs 1,150 expert‑curated tasks across eight enterprise domains (Calendar, CSM, Drive, Email, HR, ITSM, Teams, and Hybrid), with execution trajectories averaging 9 steps and reaching up to 34 steps for some tasks. (arxiv.org) Authors quantify schema complexity with a mean foreign‑key degree of 1.7 to force agents to preserve referential integrity during multi‑step workflows. (marktechpost.com) Fourteen frontier models were evaluated by the benchmark, including Claude Opus 4.5, Gemini‑3‑Flash, GPT‑5.2 (High), GPT‑5, DeepSeek‑V3.2, and GPT‑OSS‑120B in the published results. (marktechpost.com) Measured inference costs per completed task varied widely — the paper and accompanying coverage report show per‑task costs such as $0.36 for Claude Opus 4.5, $0.03 for Gemini‑3‑Flash, $0.014 for DeepSeek‑V3.2, and $0.015 for GPT‑OSS‑120B. (marktechpost.com) EnterpriseOps‑Gym’s codebase, Dockerized sandbox, and the synthetic datasets for reproducible experiments were released on GitHub and published to Hugging Face under the EnterpriseOps‑Gym project. (github.com)

ServiceNow benchmark exposes agent limits

Get your own daily briefing