Meta Deploys LLMs to Generate Failing Test Cases

Meta's engineering teams are now using LLMs to improve internal code quality by generating tests designed to fail. By deliberately crafting test cases that break proposed code changes, engineers caught eight real bugs during 41 code reviews. Four of the identified bugs would have reportedly caused production outages if not detected.

- The system, called TestGen-LLM, is part of a broader framework Meta calls "Assured LLM-based Software Engineering" (Assured LLMSE), which aims to generate code improvements with verifiable guarantees. It improves existing human-written tests rather than creating them from scratch. - This technology is an evolution of Meta's long-standing investment in automated testing, which includes Sapienz, an AI-powered tool acquired from University College London that uses search-based algorithms to find crashes in mobile apps. - TestGen-LLM uses a multi-step filtering process on the tests it generates: it first checks if the code can be built, then if the test passes without errors, and finally, if it actually increases code coverage. Any test failing these checks is discarded. - In an evaluation on Instagram's Reels and Stories, 75% of generated test cases built correctly, 57% passed reliably, and 25% successfully increased code coverage. Ultimately, Meta engineers accepted and deployed 73% of the test improvements suggested by the system. - A related internal tool, Automated Compliance Hardening (ACH), uses LLMs for "mutation testing"—a technique that deliberately introduces small defects (mutants) into the code to see if the test suite can detect them. The LLM both generates realistic "mutants" based on plain-text descriptions of concerns (e.g., privacy faults) and automatically generates the tests to "kill" (catch) them. - During a trial from October to December 2024, the ACH system was deployed for privacy testing on Facebook, Instagram, WhatsApp, and wearable platforms like Quest and Ray-Ban Meta glasses. Privacy engineers accepted 73% of the unit tests generated by the system. - While most tests generated by TestGen-LLM added coverage for a median of 2.5 lines of code, one single test case "hit the jackpot" by covering 1,326 lines, demonstrating the potential to identify significant, un-tested code paths. - This work was detailed in a February 2024 research paper titled "Automated Unit Test Improvement using Large Language Models at Meta," with further insights presented at the FSE 2025 and EuroSTAR 2025 software engineering conferences.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.