Open-sourced LLM-as-Verifier systems score 86.4% on Terminal‑Bench

- Stanford and UC Berkeley researchers open-sourced LLM-as-a-Verifier, a framework that reranks agent trajectories and reported new open results on Terminal-Bench 2 and SWE-bench Verified. - The released code claims 86.4% on Terminal-Bench 2 and 77.8% on SWE-bench Verified, beating the included pass@1 baselines of 81.8% and 76.1%. - It matters because agent gains are increasingly coming from test-time verification and selection, not just bigger base models. (github.com)

Coding agents are getting better, but a lot of the remaining failures are dumb in a very specific way. The model often produces several plausible trajectories, then picks the wrong one — or finishes with an answer that sounds fine but quietly breaks a requirement. That is the gap this new work is trying to close. Stanford and UC Berkeley researchers just open-sourced LLM-as-a-Verifier, a framework that scores candidate trajectories more carefully and then uses that score as a test-time reward signal. (github.com) ### What is the actual news? The concrete update is the release itself. The GitHub repo for LLM-as-a-Verifier went public about three weeks ago, with scripts, cached results, and trajectory data for Terminal-Bench 2 and SWE-bench Verified. The team lists Jacky Kwok, Shulu Li, Pranav Atreya, Yuejiang Liu, Marco Pavone, Ion Stoica, and Azalia Mirhoseini on the project page. (github.com) ### What does a verifier do here(github.com)ratch. It is a scoring layer that looks at completed trajectories and asks how good they really were. The repo frames this as moving beyond a single discrete judge score by using finer scoring granularity, repeated verification passes, and decomposed criteria. In plain English — instead of “good or bad,” it asks several narrower questions, several times, and aggregates them. (github.com) ### Why is that better than one-shot judging? Because one-shot judging throws away information. If two candidate runs are both imperfect, a coarse judge can flatten them into the same bucket. The verifier tries to preserve more of the probability distribution over quality, then rank trajectories with a reward-style score. That matters most when the generator already has some good attempts in the pool and the real problem is selection. The repo’s own oracle numbers make that (github.com) and the verifier’s 86.4%. (github.com) ### What are Terminal-Bench and SWE-bench measuring? They are hard agent benchmarks, but for slightly different jobs. Terminal-Bench 2 is a set of 89 realistic command-line tasks with unique environments and test-based verification. SWE-bench evaluates whether a model can fix real GitHub issues by generating patches that pass repository tests inside Dockerized environments. So these are not trivia quizzes — they are “go use tools, touch files, and make the thing work” benchmarks. (arxiv.org) ### How big are the gains? On the released runs, the jump is real but not magical. For Terminal-Bench 2, the included Forge + GPT-5.4 trajectories have an 81.8% pass@1 baseline, while LLM-as-a-Verifier reports 86.4%. For SWE-bench Verified, the listed pass@1 baseline is 76.1%, and the verifier reaches 77.8%. That is a bigger story on Terminal-Bench than on SWE-bench, which suggests verification helps most when trajectories contain richer procedural failure modes. (github.com)oes Terminal-Bench matter so much here? Because terminal tasks expose a common agent weakness — the model can look competent while drifting procedurally. It may install the wrong thing, skip a hidden requirement, or leave the environment in a bad state. Terminal-Bench was built specifically to be harder and more realistic than earlier agent benchmarks, and its authors said frontier models and agents were below 65% in the benchmark paper’s initial evaluations. A verifier layer is almost tailor-made for that kind of setup. (arxiv.org) ### Is this about better models or more compute? More compute at inference time — but targeted compute. The framework is explicitly a test-time scaling method. Instead of spending all the budget on generating one longer chain of thought, it spends budget on checking and reranking multiple finished attempts. That is becoming a broader pattern in agent work: reliability gains are coming from search, verification, and selection loops wrapped around strong models. (github.com)ine The interesting part is not just the 86.4%. It is the shape of the improvement. Open agent systems are starting to win extra accuracy by treating verification as its own problem, not as a footnote after generation. If that pattern holds, the next jump in coding agents may come less from a smarter first guess and more from a smarter second look. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.