AI Agents Fail Reliability Tests
A new battery of reliability tests from Princeton researchers found that most AI vendors don’t benchmark their agentic systems for dependability — a growing concern as agents move into mission‑critical workflows. The findings put reliability and auditable behavior front and center for buyers that will integrate AI with real‑world location and operations. (fortune.com)
Princeton’s team published a 66‑page study titled “Towards a Science of AI Agent Reliability,” with authors Rabanser, Sayash Kapoor, Kirgis, Liu, Utpala and Arvind Narayanan listed on the paper. (swept.ai) The project’s HAL reliability dashboard evaluated 14 agentic systems and reports per‑agent accuracy and multi‑metric reliability scores — the top entry, Gemini 3.0 Pro, shows ~80.8% accuracy and a reliability score of 0.85 on the dashboard. (hal.cs.princeton.edu) HAL breaks reliability into four explicit operational dimensions — consistency, predictability, robustness and safety — and implements 12 concrete metrics to measure those dimensions across two benchmarks. (hal.cs.princeton.edu) The researchers’ trend analysis highlights rising benchmark accuracy but much smaller gains on reliability measures, summarizing that accuracy improvements are not matched by proportional reliability improvements as of early 2026. (aiproductivity.ai) The leaderboard tested major models from Google (Gemini variants), OpenAI (GPT‑5.2 and GPT‑4 Turbo variants) and Anthropic (Claude variants), and only a small subset of those 14 agents reached reliability scores at or above ~0.85. (hal.cs.princeton.edu) Princeton has open‑sourced the HAL evaluation infrastructure and notes the HAL paper was accepted to ICLR 2026, enabling third‑party, cost‑aware, repeatable reliability audits of agentic systems. (hal.cs.princeton.edu) To ground the metrics, the study maps agent reliability criteria to safety‑critical engineering practices drawn from aviation, nuclear and automotive domains and encodes those into the 12 operational metrics reported on the HAL dashboard. (swept.ai)