U.S. agencies and vendors begin trials of agentic AI code scanners to find vulnerabilities
- Microsoft said on May 12 its new MDASH agentic scanner is now in limited customer preview after finding 16 Windows flaws, including four critical RCE bugs. - The bigger tell is the benchmark jump: MDASH scored 88.45% on CyberGym, while OpenAI’s GPT-5.5 posted 81.8% on the same test. - That matters because CISA and NSA just warned agentic AI needs tight permissions even as deployment in defense and infrastructure speeds up.
Code scanners are turning into AI agents. That is the real shift here. Instead of flagging a suspicious line and stopping, the new systems read code, form a theory about the bug, try to prove it, and sometimes even suggest a fix. This week, Microsoft pushed that idea into public view with a new agentic vulnerability scanner called MDASH, while U.S. agencies are simultaneously warning that the same autonomy that makes these tools useful also makes them risky. ### What changed this week? Microsoft said on May 12 that its multi-model agentic scanning harness, MDASH, found 16 previously undisclosed vulnerabilities in Windows networking and authentication components, including four critical remote-code-execution flaws. The company also said the system is already being used by Microsoft security engineering teams and is being tested with a small group of customers in a limited private preview. (microsoft.com) ### What makes this “agentic”? A normal scanner is more like a metal detector — it beeps when a pattern looks wrong. An agentic system behaves more like a junior vulnerability researcher. Microsoft says MDASH orchestrates more than 100 specialized AI agents that discover, debate, and validate possible bugs end to end. OpenAI has been pitching GPT-5.5 in similar terms — a model that can plan, use tools, check its work, and keep going across messy multi-step tasks, including coding and cyber testing. (microsoft.com) ### Why are people taking this seriously now? Because the benchmark numbers stopped looking toy-sized. Microsoft says MDASH found 21 of 21 planted vulnerabilities with zero false positives in one private driver test, hit 96% recall on five years of confirmed Microsoft cases in one subsystem and 100% in another, and scored 88.45% on CyberGym. CyberGym is a public evaluation set built from 1,507 historical vulnerabilities across 188 software projects. (microsoft.com) OpenAI’s GPT-5.5 product page lists an 81.8% CyberGym score, which gives a rough sense that frontier models are now in the same conversation as specialized security systems, even if they are not the same thing. ### Where do U.S. agencies fit in? The federal government has been inching toward this for a while. CISA disclosed in July 2024 that it had already run an operational pilot on AI-enabled vulnerability detection for government software, systems, and networks. But that pilot landed with a much more cautious conclusion: the best use of AI was to supplement existing tools, not replace them, and in some cases the extra benefit was negligible once training time and unpredictability were factored in. (microsoft.com) ### So why the fresh warning now? Because the tools got more autonomous. On May 1, 2026, CISA, the NSA, Australia’s ASD ACSC, and other partners published joint guidance for “agentic AI services.” The warning is pretty direct — these systems can expand the attack surface, accumulate too much privilege, misbehave in ways that are hard to predict, and leave murky audit trails. The practical advice is also blunt: start with low-risk use cases and do not give agents broad access to sensitive data or critical systems. (cisa.gov) ### Is this about defense only? No — and that is the catch. The same capabilities that help defenders trace attack paths and reproduce bugs can help attackers automate more of the intrusion workflow too. That is why the current race is less about a single model being “smart enough” and more about who can wrap strong controls around it — sandboxing, permission limits, logging, human review, and narrow task scopes. (cisa.gov) ### Why does the DARPA angle matter? Because this is no longer just lab work. Microsoft says several MDASH team members came from Team Atlanta, the winner of DARPA’s AI Cyber Challenge, a competition built around autonomous systems that could find and patch real bugs in widely used open-source software. Basically, the government-funded competition phase is feeding directly into commercial product trials. (cisa.gov) ### Bottom line? The news is not that AI can scan code. It has done that for years. The news is that vendors and agencies are now testing systems that can carry a vulnerability hunt much farther on their own — and everyone involved is trying to keep the permissions narrower than the ambition. (microsoft.com)