Benchmark: agents struggle with enterprise planning

The EnterpriseOps‑Gym benchmark tested stateful agentic planning over 164 DB tables and policy constraints and found top models only reached 37.4% success—highlighting strategic reasoning and constraint‑handling gaps in real enterprise workloads. That’s a quantifiable reminder that agent reasoning still fails systematically on complex, constrained tasks. ( )

The benchmark runs inside a containerized sandbox instrumented with 512 functional tools and a suite of 1,150 expert‑curated tasks across eight enterprise domains. (arxiv.org) Each task executes against pre‑seeded SQL snapshots and is scored by outcome‑based SQL verifiers that validate final database state rather than matching action traces. (github.com) Task trajectories average 9.15 steps and reach up to 34 steps, and the dataset has a mean foreign‑key degree of 1.7, creating dense referential coupling that stresses multi‑step planning. (github.com) The authors evaluated 14 frontier models and report that providing oracle human plans improves performance by 14–35 percentage points, indicating strategic reasoning (plan synthesis and decomposition) as the dominant failure mode. (arxiv.org) Refusal and safety behavior were measured explicitly; the best model correctly refused infeasible tasks only 53.9% of the time, exposing a persistent policy‑adherence gap. (arxiv.org) ServiceNow published the benchmark code and bundled seed databases (gym_dbs.zip) and a corresponding Hugging Face dataset to enable local reproducibility and verification workflows for platform testing. (github.com) Performance is domain‑dependent: models perform comparatively better on collaboration domains (Email, Teams) and notably worse in policy‑heavy domains such as ITSM and hybrid, where reported scores fall into the high‑20s range. (marktechpost.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.