Stanford/Berkeley top verifier benchmarks

- Stanford and UC Berkeley researchers open-sourced LLM-as-a-Verifier on April 9, turning answer checking into a scoring system that reranks agent trajectories. - The headline numbers are 86.4% on Terminal-Bench 2 and 77.8% on SWE-Bench Verified, beating the underlying pass@1 baselines they rerank. - It matters because verifier compute now looks like a separate scaling lever for coding agents, not just better base models.

Coding benchmarks have a weird bottleneck. The agent often already produced a good trajectory somewhere in its sampled attempts, but the system still picks the wrong one. That means the limiting factor is not always raw generation. Sometimes it is selection. That is the gap Stanford and UC Berkeley are going after with LLM-as-a-Verifier, a framework they posted on April 9 that tries to judge candidate solutions more carefully and then use that judgment to choose better outputs. ### What is the actual idea? The basic move is simple — stop treating verification like a one-shot thumbs-up or score token. Instead, the verifier breaks evaluation into finer criteria, repeats the check multiple times, and aggregates the whole score distribution rather than collapsing everything into one coarse label. The repo describes this as scaling scoring granularity, repeated verification, and criteria decomposition. ### Why does that help? Because ordinary “LLM-as-a-judge” setups tie too often on hard tasks. If two long agent traces both get a middling score, the judge has not really helped you choose. The project page says coarse scoring produced ties on 27% of Terminal-Bench comparisons. Their claim is that a more fine-grained verifier can separate “kind of right” from “actually better,” which is exactly what reranking needs. ### What did they test it on? They tested on two benchmarks that people in coding-agent land actually care about. Terminal-Bench 2.0 is a hard set of 89 terminal tasks — things like realistic command-line workflows with unique environments and tests. SWE-Bench Verified is the GitHub-issue benchmark for software engineering agents, where the model has to patch real repositories and pass the repo’s tests. ### What were the results? The headline result is state-of-the-art on both. On Terminal-Bench 2, the verifier-reranked setup reached 86.4%, versus an 81.8% pass@1 baseline from Forge + GPT-5.4 trajectories. On SWE-Bench Verified, it reached 77.8%, versus a 76.1% pass@1 baseline across the sampled runs in their evaluation setup. They also show the ceiling is still higher — an oracle that always succeeds on Terminal-Bench and 84.4% on SWE-Bench Verified. ### So is this “better model” progress? Not exactly. It is more like better selection on top of existing model outputs. The project page explicitly frames the method as a trajectory reward model for test-time scaling. In plain English, you spend extra compute after generation to inspect multiple candidate paths and choose more intelligently among them. That matters because it's not about better models, but better verification loops. ### What is the catch? The catch is cost and setup. Their released code uses Gemini 2.5 Flash as the verifier and depends on logprob extraction through Vertex AI. It also assumes you already have multiple trajectories per task — five per Terminal-Bench task in their example data, three per SWE-Bench instance set. So this is not free accuracy. You are paying with extra samples, extra judging, or both. ### Why are people paying attention? Because the benchmarks involved are becoming proxies for real engineering work. Terminal-Bench

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.