AgentRx: a debugging scalpels for agents
Microsoft Research’s AgentRx surfaced as a targeted debugging tool that pinpoints first-fault signals, categorizes semantic errors like “Information Invention,” and detects hallucinations with claimed ~23% higher accuracy versus manual audits announced. It’s explicitly billed as a ’trauma surgeon’ for agent workflows—useful for complex multi-tool orchestrations where the first bad step breaks downstream logic demoed.
Microsoft Research published the AgentRx framework on March 12, 2026. (microsoft.com) The release includes the AgentRx Benchmark of 115 manually annotated failed agent trajectories drawn from Tau-bench, Flash, and Magentic-One. (microsoft.com) The paper and blog report quantitative gains of +23.6% in step-localization accuracy and +22.9% in root-cause attribution versus prompting baselines. (microsoft.com) AgentRx’s pipeline programmatically synthesizes guarded, executable invariants from tool schemas and domain policies, evaluates them per-step to produce auditable violation logs, and hands that evidence to an LLM “judge” for final localization and classification. (arxiv.org) Public materials show a small taxonomy discrepancy: the Microsoft Research blog describes a grounded nine-category failure taxonomy, while the open-source repository and judge code reference a 10-category taxonomy. (microsoft.com) The project is open-source on GitHub under an MIT-style workflow with a CLI to run individual stages (IR, static/dynamic invariant generation, checking, judge, reporting), and the benchmark is mirrored as a Hugging Face dataset for integration into platform pipelines. (github.com)