Google’s Auto‑Diagnose for tests
Google’s Auto‑Diagnose—an LLM used inside a critique system—automatically summarises integration test failures from noisy logs, reaching about 90% root‑cause accuracy on a 71‑case sample and being applied to over 52,000 tests. The effort shows LLMs being used to speed triage in large observability and CI environments. (x.com)
Software test logs can read like black-box flight recorders: thousands of lines, scattered clues, no obvious cause. Google says an internal system now writes a short diagnosis for many failed integration tests inside code review. (arxiv.org) Integration tests check whether multiple pieces of software work together, not just whether one function passes on its own. Google’s paper says these failures generate large, noisy, heterogeneous logs and that developers report spending substantially more time diagnosing them than unit-test failures. (arxiv.org) The system, called Auto-Diagnose, uses a large language model to read failure logs, pull out the most relevant lines, and summarize the likely root cause. Google integrated it into Critique, the company’s internal code-review system, so the diagnosis appears where engineers are already reviewing changes. (arxiv.org) In a manual evaluation on 71 real-world failures, the paper reports 90.14% accuracy in identifying the root cause. After deployment across Google, the authors say it was used on 52,635 distinct failing tests. (arxiv.org) Google’s user feedback data in the paper says Auto-Diagnose was marked “Not helpful” in 5.8% of cases. Among 370 tools that post findings in Critique, it ranked No. 14 on helpfulness, placing it in the top 3.78%. (arxiv.org) The paper was posted in April 2026 and says it has been accepted to the IEEE/ACM International Conference on Software Engineering, or ICSE 2026. The authors are Celal Ziftci, Ray Liu, Spencer Greene, and Livio Dalloro, all listed as Google researchers. (arxiv.org) Google has been working on test and build triage for years, but earlier systems focused more on finding the change that broke the build or locating flaky tests. A 2017 Google paper said its regression finder put the culprit change in the top five 82% of the time across 140 projects, and a 2020 paper reported 82% accuracy for locating root causes of flaky tests across 428 projects. (research.google 1) (research.google 2) That history helps explain what changed here: instead of ranking suspect code changes, Google is using a language model to read messy text logs and explain the failure in plain language. The paper argues that this fits a task where the hard part is extracting signal from unstructured output, not compiling a formal proof. (arxiv.org) Google has also been moving similar machine-learning tools directly into developer workflows rather than keeping them as separate dashboards. In April 2024, Google said an internal machine-learning system for broken builds was enabled for all developers and that about 34% of users who hit a build breakage in a given month applied one of its suggested fixes. (research.google) The through line is less “AI writes code” than “AI shortens the time between a failure and a fix.” In Google’s setup, the new diagnosis lands in the same review system where engineers already decide what to change next. (arxiv.org)