Don’t skip human review
One thread warned that AI test reviews can paradoxically increase escapes if teams don’t audit the agent’s findings — posts claim AI test review triples escaped defects when left unchecked. (x.com) The practical take is simple: treat AI test outputs as reviewers that still need human verification and clear escalation rules. (x.com)
The warning making the rounds in software circles is not that AI review is useless. It is that AI review is easy to use badly. A pair of posts on X argued that teams can see *more* escaped defects after adding AI to test or code review if nobody checks the model’s findings, because engineers start trusting a fast reviewer that is still wrong in important ways (x.com 1) (x.com 2). That claim about defects “tripling” is the part that needs the most care. I could not find a published paper, vendor benchmark, or public incident writeup that establishes a general result that unchecked AI review triples escaped defects across teams. The number appears to come from social posts, not a verified study. What *is* easy to verify is the mechanism behind the warning: modern AI review tools can miss real problems, invent fake ones, and sound confident while doing both. GitHub’s own documentation says Copilot code review provides feedback and suggested changes, but frames it as assistance rather than a final authority, and GitHub’s broader guidance is even blunter that developers still own the merge decision (docs.github.com) (github.blog). That gap between speed and reliability is now showing up in research too. A 2025 evaluation of LLMs for code review found that, even with problem descriptions, GPT-4o and Gemini 2.0 Flash correctly classified code correctness only 68.50% and 63.89% of the time on the authors’ test set. The paper’s conclusion was not that AI review should be abandoned. It was that teams need a “Human-in-the-loop LLM Code Review” process to reduce the risk from faulty outputs (arxiv.org). Industry builders are converging on the same lesson, even when they are selling the tools. GitHub says Copilot code review is there to surface feedback quickly, not to replace accountability, and its product updates increasingly emphasize self-review before a pull request rather than autonomous approval (docs.github.com) (github.blog). Snyk makes the same point from the security side, warning that AI code review can produce both false positives and false negatives and that a human should review the output (snyk.io). The reason this matters is that bad automation does not fail loudly. It changes human behavior first. Once a team starts treating AI comments as a filter for what deserves attention, missed issues become harder to notice because reviewers assume the machine already scanned the obvious risks. Martin Fowler’s recent writing on “humans and agents” makes the same point in more general terms: the job is not to stare at every token forever, but to design a loop where humans supervise, verify, and correct the agent’s work instead of surrendering judgment to it (martinfowler.com). The strongest counterexample comes from teams that built safeguards into the system from the start. Meta’s LLM-based testing work does not present AI as a free-form reviewer whose output should simply be trusted. Its ACH system is built around mutation testing and explicit verification, with the company stressing “verifiable assurances” that generated tests actually catch the targeted faults (engineering.fb.com). That is the real divide here. AI review helps when it feeds a controlled process with audit, escalation, and ownership. It hurts when it becomes a permission slip to stop looking.