233 AI incidents reported
Stanford HAI compiled 233 AI incidents in 2024 — a 56% year‑on‑year increase — and warned that many benchmarks are not keeping up with real‑world failures. (Stanford HAI) (x.com)
Reports of real-world AI failures kept climbing in 2024, even as the systems posting bigger benchmark gains spread further into daily life. (hai.stanford.edu) Stanford University’s Institute for Human-Centered Artificial Intelligence said the AI Incidents Database logged 233 reported AI-related incidents in 2024, up 56.4% from 2023 and the highest annual count in the dataset. (hai.stanford.edu) The same Stanford report said responsible-AI testing is still not routine, even though newer safety checks such as HELM Safety and AIR-Bench are starting to appear. It also said older truthfulness tests such as HaluEval and TruthfulQA never became widely used. (hai.stanford.edu) Benchmarks are lab tests for models, like exams scored against fixed questions. Stanford said many of the newer tests are trying to measure things that show up outside the lab, including hallucinations, safety failures, and factual errors. (hai.stanford.edu) That gap opened as AI moved deeper into consumer and business use. Stanford said the U.S. Food and Drug Administration had approved 223 AI-enabled medical devices by 2023, and Waymo was providing more than 150,000 autonomous rides a week when the 2025 AI Index was published on April 7, 2025. (hai.stanford.edu) Stanford’s chart summary pointed to the kinds of cases showing up in the incident count, including deepfake intimate images and chatbots allegedly implicated in a teenager’s suicide. The report said the incident database is not comprehensive, but it still showed a sharp rise in documented harm. (hai.stanford.edu) The incident database itself describes its job as indexing the history of harms or near harms tied to deployed AI systems, not just lab mistakes or model demos. Its public incident list now ranges from deepfakes to court filings that allegedly bore the hallmarks of hallucinated legal citations. (incidentdatabase.ai) Stanford paired the incident count with a second warning: companies say they worry about AI risks, but many still are not taking matching mitigation steps. In the report’s McKinsey survey snapshot, 64% of respondents cited inaccuracy as a concern, 63% cited regulatory compliance, and 60% cited cybersecurity. (hai.stanford.edu) The same report showed why the pressure is rising. Stanford said performance on major benchmarks such as MMMU, GPQA, and SWE-bench jumped by 18.8, 48.9, and 67.3 percentage points, respectively, in a single year. (hai.stanford.edu) Stanford’s bottom line was not that benchmarks are useless. It was that faster scores, wider deployment, and a record 233 incident reports are now moving together, which leaves safety testing trying to catch up in public. (hai.stanford.edu)