Kilocode survey: 68% of agents stop before 10

- Berkeley, Stanford, IBM Research, and others released “Measuring Agents in Production,” based on 20 case studies and a 306-practitioner survey across 26 domains. (arxiv.org) - The number everyone latched onto is 68%: most production agents take 10 steps or fewer before a human steps in. Another 70% use prompting, not fine-tuning. (arxiv.org) - That matters because the field’s center of gravity is shifting from full autonomy toward bounded workflows, human checkpoints, and runtime oversight. (arxiv.org)

AI agents in production look a lot less like “set it loose and come back later” than the hype suggests. The new paper making the rounds — “Measuring Agents in Production” — is useful because it stops guessing and asks the people actually shipping these systems. (arxiv.org) And the picture is pretty clear. Real teams are not betting on long, free-running autonomy. They are building short-horizon systems with tight guardrails and frequent human intervention. ### What changed here? A research group spanning UC Berkeley, Stanford, IBM Research, UIUC, and industry partners published one of the first broad empirical looks at deployed agents. (arxiv.org) They combined 20 in-depth case studies with a survey of 306 practitioners across 26 domains, then asked a simple question: what actually works once an agent leaves the demo stage? ### Why is 10 steps the big number? Because it kills the default fantasy. In the study, 68% of production agents execute at most 10 steps before requiring human intervention. That means the median successful pattern is not “autonomous coworker for hours.” It is more like “specialized assistant that does a bounded chunk, then hands control back.” (arxiv.org) ### Why are teams keeping agents on a short leash? Reliability is the main reason. The paper says it is the top development challenge, and that tracks with what practitioners already feel — one bad tool call can be expensive, unsafe, or just annoying. (arxiv.org) So teams are solving the problem at the systems layer instead of assuming the model will reason its way out of trouble. ### What does “systems layer” mean in practice? Basically — constrain the action space. Use prompting, narrow tool access, structured workflows, and human review points. The same study says 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. (arxiv.org) That is a very pragmatic stack: don’t retrain first, wrap first. ### So are people overbuilding autonomy? Often, yes. The lesson is not that long-horizon agents never work. The lesson is that most production value seems to come from phased autonomy. Let the agent gather context, draft a plan, maybe execute a few reversible actions — then stop and ask. (arxiv.org) That pattern is showing up elsewhere too. Anthropic’s February 18, 2026 autonomy analysis found that oversight in real usage still matters a lot, even as users grant more autonomy over time. ### Where does feedback fit? This is the operational piece people miss. If an agent fails, “thumbs down” is not enough. (arxiv.org) You want feedback attached to the exact run — prompt, tools, intermediate steps, outputs, and final action. Otherwise you cannot tell whether the problem was planning, retrieval, tool choice, bad arguments, or a policy breach. That is why teams are getting more serious about trace-level evals and run-level labeling. This part is an inference from the paper’s emphasis on human evaluation and systems-level reliability, but it fits the implementation pattern practitioners are converging on. (anthropic.com) ### Why use LLM-as-judge before retraining? Because evaluation is cheaper than blind tuning. If you can classify failures first — wrong tool, unsafe tool call, missing clarification, bad final answer — you can fix the wrapper, policy, or prompt before touching weights. There is also a growing body of work showing tool-call safety needs explicit checking, not just text-level refusal behavior. In other words, an agent can sound compliant and still do the wrong thing with tools. ### What’s the bottom line? The headline number is 68%, but the deeper point is simpler: production agents are being engineered more like careful workflows than autonomous employees. (arxiv.org) The teams winning right now seem to be the ones closing the loop fast — small pre-prod evals, run-level feedback, and bounded autonomy by default. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.