Claw-Eval-Live shows 66.7% pass rate

- Researchers behind Claw-Eval-Live published a new agent benchmark on April 30, 2026, built from live workflow demand and trace-based grading. - The paper says the public release has 105 tasks across 17 families, and the best of 13 tested models passed 66.7%. - That matters because static answer-only evals can miss whether agents actually did the work they claimed to finish.

AI agent benchmarks are starting to split into two camps. One camp asks whether a model can produce a convincing answer. The other asks whether it can actually do the job — click the right things, change the right state, leave behind the right files, and finish without making stuff up. Claw-Eval-Live lands squarely in the second camp, and that is why people are paying attention. The paper went up on arXiv on April 30, 2026, with a public release built to test real workflow execution rather than polished final text. (arxiv.org) ### What is Claw-Eval-Live, exactly? It is a benchmark for workflow agents — the kind of systems meant to operate across business apps, local files, shells, and multi-step tasks. Instead of freezing a task list once and keeping it forever, the benchmark says it pulls from live marketplace demand signals, then turns those signals into a time-stampe(arxiv.org)ly. In this release, the task mix comes from ClawHub Top-500 skills and covers both service workflows and workspace repair. (arxiv.org) ### Why are people calling it “live”? Because the benchmark is trying to track what people actually need agents to do now, not what looked representative months ago. That sounds small, but it changes the whole point of the test. A frozen benchmark slowly turns into an exam for yesterday’s workflows. Claw-Eval-Live is trying to stay attached to cu(arxiv.org)le snapshot for others to run. (arxiv.org) ### What makes the grading different? The big shift is evidence. The benchmark records execution traces, audit logs, service state, and post-run workspace artifacts. Then it uses deterministic checks when the evidence is concrete, and structured LLM judging only when the task has a semantic output that cannot be verified mechanically. Basically, (arxiv.org)t sounds right than whether the environment shows the job got done. (arxiv.org) ### So what were the actual results? The paper’s public release includes 105 tasks across 17 task families and evaluates 13 frontier models under one pass rule. The headline number is blunt — the leading model passed only 66.7% of tasks, and no tested model reached 70%. That is the part people are latching onto, because it says “pretty good demo” still does not mean “production reliable.” (arxiv.org) ### Where do agents break? Not evenly. The paper says failures cluster by task family and execution surface. HR, management, and multi-system business workflows were persistent bottlenecks, while local workspace repair was easier but still not saturated. That pattern makes intuitive sense — patching files in one environment is hard, but coordinat(arxiv.org)d hidden dependencies is the harder version of the trick. (arxiv.org) ### Is leaderboard rank the whole story? No — and this is one of the more useful points in the paper. Models with similar pass rates can still differ in overall completion behavior, so a single rank can flatten important differences. The authors also say the most discriminative tasks sit in a middle band — not trivial, not impossible — which is where reliability gaps show up most clearly. (arxiv.org) ### Why does this matter beyond one benchmark? Because agent buyers and builders keep running into the same problem. Offline evals often reward plausible-looking outputs, but production failures come from missed steps, broken tool use, and hallucinated completion. Claw-Eval-Live’s argument is that useful evaluation has to be grounded twice — firs(arxiv.org) action logs. (arxiv.org) ### Bottom line? The interesting news is not just that 66.7% is lower than people want. It is that a benchmark built to inspect real execution still finds a wide reliability gap in 2026. Agents are getting better — but the hard part is no longer sounding capable. The hard part is finishing the workflow and leaving proof behind. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.