Princeton flags AI agent reliability

- Princeton researchers released “Towards a Science of AI Agent Reliability” in February, arguing that newer AI agents are getting more capable without becoming proportionally more dependable across repeated runs, perturbations, confidence, and safety. - The paper evaluates 14 models on 12 metrics across two benchmarks and finds only small reliability gains, while Microsoft researchers reported frontier models corrupt about 25% of document content in long delegated workflows. - The findings add evidence that benchmark accuracy alone misses operational failures in production agent systems, from prompt sensitivity to silent file corruption. (arxiv.org 1) (arxiv.org 2)

An AI agent is a model that does work in steps, like browsing, editing files, or calling tools instead of just answering once. Princeton researchers say the field still lacks a good way to measure whether those systems behave reliably. (arxiv.org) (citp.princeton.edu) Their paper, “Towards a Science of AI Agent Reliability,” was posted to arXiv on February 18 and revised on February 23, 2026. The authors are Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, and Arvind Narayanan. (arxiv.org) The group says standard agent benchmarks compress behavior into a single success score, which can hide whether a system repeats the same result, breaks under small changes, misstates its confidence, or causes outsized damage when it fails. They split reliability into four dimensions: consistency, robustness, predictability, and safety. (arxiv.org 1) (arxiv.org 2) They tested 14 models across two benchmarks using 12 metrics. Their headline result is that capability has risen faster than reliability, with only small reliability improvements over roughly 18 months of model development. (arxiv.org) (hal.cs.princeton.edu) The Princeton dashboard shows why that gap matters in practice. It says agents that can solve a task often cannot do so consistently, and that prompt robustness still varies sharply even when models handle infrastructure faults well. (hal.cs.princeton.edu 1) (hal.cs.princeton.edu 2) One finding cuts against the usual assumption that bigger models are automatically steadier. Princeton says calibration, robustness, and safety often improve with scale, but consistency can lag, and smaller models sometimes match or exceed larger ones on repeatability. (hal.cs.princeton.edu) A second paper from Microsoft Research pushes the same concern into document editing, where an agent is supposed to modify files over many turns without breaking them. That study, “LLMs Corrupt Your Documents When You Delegate,” was posted to arXiv in April 2026. (arxiv.org) Microsoft’s team introduced a benchmark called DELEGATE-52, covering 52 professional domains and 310 work environments with documents of about 15,000 tokens and 5 to 10 editing tasks each. The authors say current models “degrade documents during delegation” instead of preserving them. (arxiv.org) Their large-scale experiment covered 19 models, including Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4. The paper says even frontier systems corrupted an average of 25% of document content by the end of long workflows, and tool use did not improve results on DELEGATE-52. (arxiv.org) The Microsoft paper also says the corruption gets worse as documents get larger, interactions get longer, or distractor files are added. The errors are described as sparse but severe, which means a file can look mostly intact while key parts are silently wrong or missing. (arxiv.org) Taken together, the two papers describe the same production problem from different angles. One says benchmark wins do not guarantee stable agent behavior; the other says long delegated workflows can quietly damage the artifacts those agents are supposed to maintain. (arxiv.org 1) (arxiv.org 2) Princeton’s answer is more measurement, not a single leaderboard score. Microsoft’s result points in the same direction: if agents are going to edit files, browse, and act over many steps, the systems around them need checks that catch failure before the damage compounds. (hal.cs.princeton.edu) (arxiv.org)

Princeton flags AI agent reliability

Get your own daily briefing