Turing finds tool-order failures dominate

- Turing published a new agent-evaluation case study showing its verifier-based HR benchmark can separate models by real execution, not polished final answers. - The sharpest result: tool sequencing drove over 80% of failures in top models, while one leading model scored about 70% pass rate. - It matters because agent benchmarks are shifting from chatbot-style grading to execution traces, state changes, and reproducible failure taxonomies.

AI agents are starting to hit a very specific wall. Not language. Not even raw intelligence, exactly. The wall is execution — picking the right tool, in the right order, with the right parameters, and carrying state across a long workflow without drifting. That is the point of Turing’s new evaluation write-up, which turns a vague complaint about “agent unreliability” into something much more concrete: most failures came from workflow mechanics, not from bad-looking prose. ### What did Turing actually publish? Turing put out a case study on an execution-grounded benchmark for agent workflows in HR operations. The setup used 100+ workflow tasks, 1,000+ runs per model, and 3,000+ automated checks to verify whether the agent really completed the job inside the system. That means the benchmark cared about database changes, triggered workflows, and ordered tool calls — not whether the final answer merely sounded competent. ### Why is that different from normal benchmark talk? Most model benchmarks still grade the visible surface — the final text. But agents fail in the invisible layer. They call the wrong function. They pass a made-up argument. They skip a prerequisite step and then keep going as if nothing broke. Turing’s benchmark is built around that hidden layer, so it can catch execution failures that a human reading the final response might miss. ### So what broke most often? Tool sequencing dominated. Turing says errors in execution order affected over 80% of failures in the top models. That is a useful result because it narrows the problem. The biggest issue was not always “the model doesn’t know the answer.” It was often “the model knows roughly what to do but performs the steps in the wrong order,” which is more like a workflow-control problem than a pure knowledge problem. ### What about parameter hallucinations? Those showed up too, but unevenly. Turing describes parameter-use failures as a model-specific weakness — largely absent in top-tier models and much more common in weaker ones. In plain English, some agents were not just choosing the wrong step; they were inventing or misfilling the inputs to otherwise valid tools. That is the software equivalent of dialing the right department with the wrong account number. ### How big were the performance gaps? Pretty big. Turing says the benchmark produced clear separation between models, with the top model around a 70% pass rate, another around 50%, and a much weaker one near 5%. That spread matters because it suggests execution-grounded tests can expose real differences that flatter benchmarks blur away. A model that writes smooth explanations can still fall apart once it has to manage state across multiple actions. ### Why use verifiers at all? Because they grade outcomes, not vibes. Turing’s verifiers checked whether required actions actually happened after tool execution by comparing expected and actual system state. Basically, this is closer to a unit test than an essay rubric. If an onboarding workflow was supposed to update a record and trigger a background check, the benchmark checked whether those things happened. ### Is this just about HR workflows? No — HR is the test bed, not the whole point. The broader claim is that agent evaluation needs repeatable environments, state tracking, and failure taxonomies that map onto production systems. Turing is making the case that if you want reliable agents, you should audit execution paths first: tool choice, ordering, parameters, recovery, and state transitions. ### What’s the bottom line? The headline is not that agents are bad at language. It is that they are still brittle at procedure. Turing’s result gives that brittleness a shape: the dominant failure mode was step ordering. That is useful news because it points builders toward a fixable layer — orchestration, verification, and state management — instead of treating every bad run as a mysterious reasoning collapse.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.