Benchmark finds agents fail 72% workflows
- A new benchmark reported that Claude, GPT and Gemini agents failed 72% of tested U.S. healthcare workflows, highlighting brittleness in agentic automation. (markets.financialcontent.com) - The study framed the failures as common in messy, exception‑heavy healthcare processes where agents lack robust business‑rule gating and integration. (markets.financialcontent.com) - That result supports constrained, deterministic automation inside EHR workflows and heavy human‑in‑the‑loop checkpoints. (markets.financialcontent.com)
1/ A new healthcare benchmark is getting attention because it claims today’s leading AI agents still break on most real-world workflows. The headline result: the best-performing agent passed only 28% of tested cases across 75 healthcare workflows, meaning it failed about 72%. (markets.financialcontent.com) 2/ The benchmark is called CHI-Bench, released by AI company actAVA.ai on May 20, 2026. actAVA says it is an open-source benchmark for “long-horizon, policy-rich U.S. healthcare workflows,” with code, data and a live leaderboard published on its benchmarks site. (markets.financialcontent.com) 3/ What’s being tested here is not medical trivia or document extraction. actAVA says CHI-Bench is designed to simulate operational work that unfolds over many steps, across apps, documents, teams and business rules. That matters because many public AI benchmarks still measure what a model knows, not whether an agent can complete a workflow. (markets.financialcontent.com) 4/ The setup is large by benchmark standards. actAVA says CHI-Bench covers 75 workflows, spans 3 domains — prior authorization, utilization management and care management — and exposes agents to 21 healthcare apps through 200+ MCP tools, plus a 1,279-document operations handbook. (markets.financialcontent.com) 5/ The benchmark’s core claim is that healthcare work fails in ways ordinary agent demos often hide. In actAVA’s framing, one missed policy check can trigger a denied authorization, delayed treatment or audit problem, so long workflow reliability matters more than one-shot answer quality. (markets.financialcontent.com) 6/ On the leaderboard actAVA described in its release, Anthropic’s Claude Code with Opus 4.6 posted the top overall result at 28% pass@1, followed by OpenAI’s Codex with GPT-5.5 at 21%. Domain-level results in the release were 41% for utilization review, 32% for care management and 29% for prior-authorization paperwork. (markets.financialcontent.com) 7/ The more damaging numbers are on repeatability and endurance. actAVA said no agent cleared 20% when the same case was run three times. In an endurance test of 25 cases in one session, the best system completed under 4%. In a fully end-to-end setup with one AI submitting prior auth and another acting as reviewer, no task passed successfully. (markets.financialcontent.com) 8/ That is the key point of the story: the benchmark is arguing that “agentic” performance degrades sharply once you move from isolated tasks to messy, exception-heavy operations. actAVA’s own docs describe CHI-Bench as focused on long-horizon workflows with branching decisions, tool use and clear outcomes, which supports that interpretation. (actava.ai) 9/ There is an important caveat. This is not, from the material I could verify, a peer-reviewed journal paper. The most visible write-up is a syndicated press release, and the benchmark is published by a company that also sells healthcare AI evaluation and compliance products. That does not make the result false, but it does mean readers should treat it as a vendor-published benchmark pending independent replication. (markets.financialcontent.com) 10/ Even so, the result fits a broader pattern in healthcare AI deployment: narrow, bounded automation tends to hold up better than open-ended autonomy. Google Cloud’s own healthcare materials describe agents as acting under user control and supervision, not as unsupervised replacements for complex clinical operations. (cloud.google.com) 11/ So what should operators take from this? Not “agents are useless.” More like: if you want AI in healthcare workflows, constrain the action space. Put deterministic business-rule checks around the model. Keep write-backs auditable. Use humans for exception handling, approvals and edge cases. That conclusion is an inference from the benchmark design and results, not a direct quote from an outside regulator. (markets.financialcontent.com) 12/ The practical near-term use case is likely not “let the agent run the whole workflow.” It is using AI at the interface layer — collecting missing info, drafting artifacts, summarizing blockers, routing work, and handing off before a consequential state change. The benchmark’s failure pattern is strongest precisely where long chains, cross-system dependencies and policy checks stack up. (markets.financialcontent.com) 13/ One more detail worth watching: actAVA says CHI-Bench was built with a coalition of 20+ institutions, including Johns Hopkins, Wellstar, Yale, Stanford, Carnegie Mellon, Oxford, USC and UC San Diego, and names researchers including Sanmi Koyejo, Eric P. Xing and Philip S. Yu. Independent scrutiny of the benchmark, leaderboard submissions and reproducibility will matter more than the launch headline. (markets.financialcontent.com) 14/ Net: the benchmark does not prove healthcare agents cannot work. It does suggest that, as of May 2026, frontier agents remain unreliable on full-stack healthcare workflows, especially when the job requires consistency across many steps, systems and policy constraints. (markets.financialcontent.com)