Scale's HiL‑Bench Reveal

- Scale AI published HiL‑Bench, a human‑in‑the‑loop benchmark testing agents' judgment about when and how to call humans for help. - Frontier models fell from 89% to 4% success under information gaps, while reinforcement learning improved generalisation. - The benchmark includes a leaderboard, paper, and blog to push better human‑agent coordination evaluation (x.com).

Scale AI published HiL-Bench on April 20, a new test for whether artificial intelligence agents know when to stop guessing and ask a human for missing information. (scale.com) The benchmark takes tasks from SWE-Bench Pro for software engineering and BIRD for text-to-SQL, then inserts missing, ambiguous, or contradictory details that only appear as the agent explores the task. It measures whether the agent asks a targeted clarifying question at the right moment through an `ask_human` tool. (scale.com; labs.scale.com) Scale’s results show a sharp drop once that judgment is required: frontier models that solve 75% to 89% of tasks with full information recover only a fraction of that when they must decide for themselves whether to ask for help. In Scale’s public summary, performance peaks at 38% on SQL tasks and 12% on software engineering under those blocked conditions, with some runs falling as low as 4%. (scale.com; arxiv.org) That gap comes from a basic production problem: benchmarks usually grade whether an agent finished the task, not whether it noticed that the instructions were incomplete. The HiL-Bench paper calls that skill “selective escalation,” meaning the model has to recognize an uncertainty it cannot resolve alone and surface it before it commits to a wrong answer. (arxiv.org) HiL-Bench is built to make guessing harder to hide. The dataset has 300 tasks across software engineering and SQL, including 200 public tasks and 100 private held-out tasks, with 1,131 total blockers and an average of 3.8 blockers per task. (labs.scale.com) The paper introduces an “Ask-F1” score, a metric that balances two things at once: whether the agent asks precise questions and whether it catches the blockers that matter. The authors say that design is meant to prevent a model from gaming the benchmark by spamming broad requests for help. (arxiv.org) Scale’s failure analysis says different model families miss in different ways. Its blog says GPT models often continue on a wrong assumption, Claude often recognizes uncertainty but still submits an answer, and Gemini asks more often but with questions too broad to get the needed detail. (scale.com) The paper also reports that reinforcement learning improved this behavior on a 32-billion-parameter model. Ask-F1 rose by 28 percentage points on SQL and 17 points on software engineering, and the gains carried across domains instead of staying tied to one task type. (scale.com; arxiv.org) Scale released the benchmark with a paper, dataset resources, and a public leaderboard through Scale Labs. The setup turns a familiar agent feature — the button to ask a user a question — into something that can be measured, compared, and trained. (labs.scale.com; arxiv.org)

Scale's HiL‑Bench Reveal

Get your own daily briefing