Claude Managed Agents gain outcomes-based grading to measure agent performance
- Anthropic added “outcomes” to Claude Managed Agents on May 6, letting developers define success with rubrics and have agents iterate until they pass. - The key detail is the grading loop: a separate grader checks outputs in its own context window, and Anthropic says success rose by up to 10 points. - This matters because agent evals are shifting from vague prompt tuning to explicit, automatable quality bars teams can track across real workflows.
Agent evals are the boring part nobody wants to talk about — until an agent starts doing real work and nobody can tell if it actually did a good job. That is the gap Anthropic is trying to close with its new “outcomes” feature for Claude Managed Agents, released on May 6. The basic idea is simple: instead of hoping a prompt nudges an agent toward quality, you write down what “good” means and let a grader enforce it. That turns agent behavior from vibes into something much closer to a test. ### What actually shipped? Anthropic bundled three Managed Agents updates together — dreaming, outcomes, and multiagent orchestration — but outcomes is the piece about measurement. Developers can now define a rubric for a task, run an agent against that rubric, and have the system keep checking whether the result meets the bar. Anthropic also tied this into webhooks, so teams can launch a run and get notified when the job is done. ### What is an “outcome” here? Basically, it is a structured definition of success. Anthropic’s examples are concrete: a presentation standard, a structural framework, a requirements list, brand voice, or visual guidelines. Instead of asking the model once and eyeballing the answer, you tell the system what the finished work must satisfy. The agent then works toward that target rather than just toward the next token. (claude.com) ### Why use a separate grader? Because letting the same model both do the work and judge the work can get mushy fast. Anthropic says the grader runs in its own context window, separate from the agent’s reasoning, so the check is less likely to be contaminated by the agent’s own chain of thought or self-justification. If the output misses the mark, the grader identifies what failed, and the agent takes another pass. That is the important design choice here — the system is built to revise, not just produce. (claude.com) ### Why is that better than prompt engineering? Prompting can tell an agent to “be thorough” or “follow these standards,” but that still leaves success fuzzy. Outcomes makes the quality bar explicit and machine-checkable. Anthropic has been arguing for months that agent evals need tasks, trials, and graders — not just one-off demos — because agents operate across many turns, use tools, and can fail in ways that compound over time. Outcomes is that eval philosophy turned into a product feature. (claude.com) ### Did Anthropic show any gains? Yes — though they are Anthropic’s internal numbers, so treat them as product-benchmark evidence, not a universal law. The company says outcomes improved task success by up to 10 points over a standard prompting loop, with the biggest gains on harder problems. It also reported better file generation quality, including +8.4% task success on docx and +10.1% on pptx in internal benchmarks. (anthropic.com) ### Where does this fit in Managed Agents? Managed Agents is Anthropic’s hosted service for long-running agent work. The pitch is that Anthropic handles the harness — sessions, tool routing, sandboxing, orchestration — while developers focus on the task. Outcomes adds a quality-control layer on top of that stack. So the platform is no longer just “run an agent for me.” It is moving toward “run an agent, remember what it learned, coordinate subagents, and prove the output cleared a bar.” (claude.com) ### What is the real significance? The shift is from generation to accountability. If agents are going to write decks, produce documents, or complete long workflows without a human checking every step, teams need pass/fail logic they can trust. Outcomes does not solve the whole eval problem — rubrics can still be incomplete, and graders can still miss edge cases — but it gives builders a cleaner unit of control. That is a big step toward treating agents less like chatbots and more like software systems with test coverage. (anthropic.com) ### Bottom line? Anthropic is turning agent quality into something you can specify, score, and automate. That sounds small, but turns out it is one of the missing pieces for making agents usable in production. (claude.com) (anthropic.com)