Claw‑Eval‑Live shows 66.7% pass

- Chenxin Li and collaborators posted Claw‑Eval‑Live on April 30, a new agent benchmark that turns public workflow demand into 105 scored tasks. - The headline number is blunt: the best model passed 66.7% of tasks, and the paper says no evaluated model cleared 70%. - That matters because agent demos keep improving, but end‑to‑end reliability on real workflows still breaks on multi‑system business work.

Agent benchmarks usually have a fake-ceiling problem. Models learn the shape of a frozen test, scores creep up, and everyone starts sounding more confident than the systems really deserve. Claw‑Eval‑Live is trying to break that loop. The new benchmark, posted on arXiv on April 30, 2026, swaps in a live demand signal and then checks what agents actually did — not just whether the final answer looked plausible. ### What is this thing, exactly? Claw‑Eval‑Live is a benchmark for workflow agents — the kind of systems that are supposed to move through business software, local files, and multi-step tasks without a human stitching the steps together. The setup matters because this is not just “answer a question” AI. It is “open the right tool, change the right state, leave the workspace in the right condition” AI. ### What changed from older benchmarks? The big change is the task source. Instead of freezing a handcrafted set once and calling it representative forever, the team says it builds releases from public workflow-demand signals, using ClawHub Top‑500 skills for the current release, then packages them into reproducible tasks with fixed fixtures, services, and graders. Basically a test environment on the back end. ### Why does “live” matter so much? Because agent usefulness shifts with the work people actually want done. A static benchmark can turn into a museum piece fast. If the benchmark keeps refreshing from public demand, the score starts to mean something closer to “can this help with current workflows?” rather than “can this imitate last year’s benchmark style?” That is the real pitch here. ### What did they test? This first release has 105 tasks, 17 families, and 13 frontier models. The tasks span controlled business services and local workspace repair. That mix is important — some jobs are about interacting with service state across systems, while others are more like fixing or transforming files in a workspace. Those are different failure modes, and the benchmark tries to separate them. ### So what is the headline result? The top-line result is worse than a lot of agent hype would suggest. The paper says the leading model passes only 66.7% of tasks, and no model reaches 70%. In plain English, even the best system in this test still fails about one out of every three real workflow tasks. That is not catastrophic for experimentation. But it is nowhere near “production-ready.” ### Where do the models break? Not evenly. The paper says failures cluster in HR, management, and multi-system business workflows, while local workspace repair is easier but still not solved. That pattern makes intuitive sense. Editing or repairing a local workspace is like working at one desk. Multi-system workflows are more like coordinating three desks, two inboxes, and switching burden is higher, and small mistakes compound. ### Why are the scores more believable? Because the grading watches the work. Claw‑Eval‑Live records execution traces, audit logs, service state, and final workspace artifacts. Deterministic checks handle cases where the evidence is clear, and structured LLM judging is reserved for more semantic outputs. So the benchmark is less vulnerable to the classic agent trick of sounding right while doing the wrong thing. ### What’s the bottom line? This is a reality check for the agent wave. Models are getting good enough to look competent across long workflows, but this benchmark says reliable completion is still the bottleneck. If your mental model was “agents are basically solved,” Claw‑Eval‑Live is a pretty direct correction.

Claw‑Eval‑Live shows 66.7% pass

Get your own daily briefing