Microsoft MDASH tops CyberGym benchmark

- Microsoft said on May 12 that its MDASH cyber system topped the CyberGym benchmark, posting the highest published score among evaluated systems. - Microsoft reported an 88.45% CyberGym score, saying MDASH used more than 100 specialized agents across frontier and distilled models. - Anthropic briefed the House Homeland Security Committee on Mythos on May 13, with a related hearing planned.

Microsoft said on May 12 that its new multi-model security system, codenamed MDASH, posted the top published result on the CyberGym cybersecurity benchmark, a public test built by researchers at the University of California, Berkeley. The company said MDASH scored 88.45% on CyberGym and used more than 100 specialized agents across multiple models rather than relying on a single model. Microsoft said the system is already being used by its internal security engineering teams and is in a limited private preview with customers. The result arrived as Anthropic was briefing the House Homeland Security Committee on its own cyber-focused model, Mythos, in a closed-door session on May 13. ### How did Microsoft say MDASH beat the benchmark? Microsoft vice president Taesoo Kim wrote in a company post that MDASH reached an “industry-leading 88.45% score” on the public CyberGym benchmark, which the company said was roughly five points ahead of the next entry. The post said the harness orchestrates more than 100 specialized agents across an ensemble of frontier and distilled models to “discover, debate, and prove exploitable bugs end-to-end.” (microsoft.com) CyberGym says its benchmark contains 1,507 historical vulnerabilities from 188 large software projects. The project’s leaderboard page shows Anthropic Agent running Claude Mythos Preview at 83.1%, OpenAI Agent running GPT-5.5 at 81.8%, and OpenAI Agent running GPT-5.4 at 79.0% on Level 1, where agents receive a vulnerability description and an unpatched codebase. ### What is CyberGym actually measuring? (microsoft.com) UC Berkeley researchers behind CyberGym say the benchmark evaluates whether AI agents can reproduce target vulnerabilities by generating working proof-of-concept exploits. The leaderboard says a run is counted as successful if any one trial succeeds. The benchmark’s framing matters because it tests agent performance on vulnerability-oriented tasks rather than general chatbot behavior. (cybergym.io) Microsoft’s post described MDASH as a “multi-model agentic scanning harness,” while CyberGym’s documentation describes the benchmark as a way to assess real-world vulnerability analysis tasks at scale. ### What else did Microsoft say MDASH has found? Microsoft said MDASH helped researchers find 16 new vulnerabilities across the Windows networking and authentication stack, including four critical remote-code-execution flaws in components such as the Windows kernel TCP/IP stack and the IKEv2 service. The company also said the system found 21 of 21 planted vulnerabilities with zero false positives on a private test driver, and posted 96% recall against five years of confirmed Microsoft Security Response Center cases in clfs.sys and 100% in tcpip.sys. (microsoft.com) Those claims come from Microsoft’s own disclosure rather than an independent benchmark operator. The company said several members of the team behind MDASH came from Team Atlanta, the group that won the $29.5 million DARPA AI Cyber Challenge. (microsoft.com) ### Where does Anthropic’s Mythos fit into the same story? Anthropic said on April 7 that Claude Mythos Preview was “strikingly capable” at computer security tasks and launched Project Glasswing to use the model to help secure critical software. In a technical post, Anthropic said Mythos Preview was capable of identifying and exploiting zero-day vulnerabilities in major operating systems and web browsers during its testing, while also saying it was limiting disclosures because most vulnerabilities it found had not yet been patched. (microsoft.com) CyberGym’s leaderboard still lists Mythos Preview at 83.1%, which made it the top published single entry on the board before Microsoft announced MDASH. Microsoft’s result did not displace Mythos with another single model; it came from a coordinated system of agents and models. That distinction is based on Microsoft’s description of MDASH’s architecture and CyberGym’s published leaderboard entries. (red.anthropic.com) ### Why are lawmakers getting briefed now? The House Homeland Security Committee held a closed-door briefing from Anthropic on May 13, according to CyberScoop. The outlet reported that the committee is planning additional oversight, including a hearing that cybersecurity subcommittee chair Andy Ogles said he intends to hold. (microsoft.com) CyberScoop reported that the Anthropic briefing included a live demonstration of Mythos and that committee Democrats are requesting a classified briefing. The Hill, as cited by CyberScoop, reported that Anthropic’s side of the briefing was led by Logan Graham from the company’s frontier red team and Josh Tilstra from its national security programs and policy team. (cyberscoop.com) ### What happens next for MDASH and Mythos? Microsoft said MDASH is in use by its security engineering teams and is being tested with a small set of customers in a limited private preview. Anthropic has already begun briefing lawmakers, and CyberScoop reported that the House Homeland Security Committee is preparing a hearing related to Mythos after the May 13 closed session. (microsoft.com) (cyberscoop.com)

Microsoft MDASH tops CyberGym benchmark

Get your own daily briefing