Early enterprise agents succeed on ~37% of long workflows, study finds

- ServiceNow researchers’ WorkArena++ benchmark showed top web agents still fail most realistic enterprise workflows, completing only a small share of 682 multi-step tasks. - The gap gets brutal as tasks lengthen: humans solved 93.9% overall, while frontier models managed low single digits on hardest workflows. - That matters because enterprises are deploying agents fast, but production value now depends less on demos and more on orchestration.

Enterprise agents are supposed to do the boring office work — reset accounts, route approvals, update records, close tickets. That is the pitch. But once you ask them to handle the long version of the job, not just one click or one answer, performance drops hard. The clearest evidence comes from ServiceNow Research’s WorkArena and WorkArena++ benchmarks, which try to mimic the kind of browser-based work people do in HR and IT service systems. The basic story is simple: agents can often handle pieces of enterprise work, but they still fall apart on full workflows. (arxiv.org) ### What is being measured here? WorkArena is a benchmark built on the ServiceNow platform — the kind of software companies use for ITSM, HR service delivery, and internal operations. The original benchmark focused on atomic tasks like filling forms, retrieving information, or making specific updates in a browser. WorkArena++ extends that into 682 more realistic, compositional tasks that bundle planning, retrieval, (arxiv.org)t is much closer to what “enterprise agent” actually means in practice. (arxiv.org) ### Why do long workflows break agents? Because the hard part is not one action. It is carrying state across many actions without drifting. A real workflow might require finding the right record, interpreting policy, doing a quick calculation, choosing the right next step, and then updating multiple fields in the right order. Miss one dependency and the whole run is wrong. The benchmark was designed specifically to (arxiv.org)solving, logical or arithmetic reasoning, retrieval, and contextual understanding — instead of isolated clicks. (arxiv.org) ### So what were the actual results? On WorkArena++, human workers solved 93.9% of the tasks. The best frontier models in the paper were nowhere close, and performance on the hardest multi-step tasks dropped to low single digits. That is the punchline people keep compressing into a single scary stat: enterprise agents are not broadly reliable on long workflows yet. They may look competent in demos or on short task f(arxiv.org) bottleneck. (arxiv.org) ### Does that mean enterprise agents are overhyped? Not exactly. It means the market is ahead of the capability curve. Enterprises are adopting agents fast — OutSystems said 96% of surveyed organizations are already using AI agents in some capacity, and Gartner’s forecast points to agents spreading through enterprise apps by the end of 2026. But adoption is not the same thing as dependable autonomy. A lot of what is(arxiv.org)ause the scope is narrow, the workflow is structured, and humans remain in the loop. (outsystems.com) ### Why are some companies still reporting good outcomes? Because constrained systems can work really well. If a company narrows the task, adds guardrails, structures the data, and defines escalation paths, the agent has fewer ways to go off the rails. That is why vendor case studies can show strong delivery or resolution numbers while research benchmarks still look rough. They a(outsystems.com)ompetence. Deloitte makes basically this point: the winners are redesigning operations around agents, not just dropping agents into human-built processes. (deloitte.com) ### What is the missing layer? Orchestration. Enterprises need systems that break goals into steps, verify progress, recover from mistakes, and hand work to humans when confidence drops. Think less “one super-agent does everything” and more “planner, executor, checker, and escalation logic working together.” The benchmark results make that pretty clear — the failure is usually not language alone, but coordination over time. (arxiv.org) ### Why does this matter right now? Because companies are moving from chatbot experiments to workflow automation budgets. If leaders mistake partial competence for full autonomy, they will automate the wrong layer and get brittle systems. If they treat agents as components inside supervised workflows, they can still get real value now. (outsystems.com) is real, but the version that works today is narrower than the hype. Agents are getting useful at pieces of work. Full, messy, multi-step office workflows are still mostly a planning problem — and that is where the gap remains.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.