Microsoft multi-agent tops Mythos

- Microsoft said on May 12 its MDASH multi-agent security system topped the CyberGym benchmark, beating Anthropic’s Mythos and other single-model entries. - Microsoft reported MDASH scored 88.45% on CyberGym’s 1,507 real-world vulnerabilities, about five points ahead of the next leaderboard entry. - OpenAI said GPT-5.5-Cyber remains in limited preview, while Microsoft is testing MDASH with a small customer set.

Microsoft said on May 12 that its new vulnerability-scanning system, codenamed MDASH, posted the top score on the CyberGym benchmark, a public test built around 1,507 real-world software vulnerabilities. The company said MDASH scored 88.45%, ahead of Anthropic’s Mythos and other systems on the leaderboard, including OpenAI entries. Microsoft attributed the result to orchestration rather than a single model, saying the system uses more than 100 specialized AI agents that work across multiple frontier and distilled models. The result adds a new data point to a fast-moving contest among AI companies to show their systems can do useful cybersecurity work under controlled conditions. ### What exactly did Microsoft say it built? Microsoft described MDASH as a “multi-model agentic scanning harness” built by its Autonomous Code Security team. The company said the system coordinates more than 100 specialized agents to discover, debate and validate exploitable bugs end to end, rather than relying on one model to handle the full workflow. (microsoft.com) Taesoo Kim, a Microsoft vice president for agentic security, said in the company’s May 12 post that the harness helped researchers find 16 new vulnerabilities across the Windows networking and authentication stack, including four critical remote-code-execution flaws. Microsoft also said the system found 21 of 21 planted vulnerabilities on a private test driver with zero false positives, and reached 96% recall against five years of confirmed Microsoft Security Response Center cases in one Windows component and 100% in another. (microsoft.com) ### Where did Anthropic’s Mythos fit into this result? Anthropic said in its Project Glasswing announcement that Claude Mythos 2 Preview is an unreleased frontier model being used for defensive security work by partners including Microsoft, Amazon Web Services, Apple, Cisco, CrowdStrike, Google and JPMorganChase. Anthropic said the model had already found thousands of high-severity vulnerabilities, including some in every major operating system and web browser. (microsoft.com) Microsoft’s benchmark result matters partly because Mythos had become a reference point in AI cybersecurity testing. Microsoft had previously said it evaluated an early snapshot of Claude Mythos Preview using CTI-REALM, an open-source benchmark created to assess AI agents on real-world detection tasks, and Microsoft’s April security post said the company was partnering with Anthropic and others on AI-driven vulnerability discovery. (anthropic.com) ### Why does the leaderboard focus on systems, not just models? Microsoft said the “durable advantage” sits in the agentic system around the model rather than in any single model itself. In its account of the benchmark, the company said MDASH’s edge came from dividing work among agents that scan code, test hypotheses and verify whether a suspected flaw is real. (microsoft.com) OpenAI has framed its own cyber offering in similar workflow terms. OpenAI said on May 7 that GPT-5.5 and GPT-5.5-Cyber are being provided through its Trusted Access for Cyber program for verified defenders, with access levels and safeguards adjusted to the task. The company said approved users can use the models for vulnerability identification, triage, malware analysis, reverse engineering, detection engineering and patch validation, while safeguards continue to block credential theft, stealth, persistence and exploitation of third-party systems. (microsoft.com) ### How much of this is already being used outside the lab? Microsoft said MDASH is already being used by Microsoft security engineering teams and is being tested by a small set of customers in a limited private preview. The company linked the announcement to a sign-up page for that preview in its May 12 post. (openai.com) Anthropic said Project Glasswing includes more than 40 additional organizations that build or maintain critical software infrastructure, beyond the named launch partners. Anthropic also said it is committing up to $100 million in usage credits for Mythos Preview and $4 million in direct donations to open-source security organizations. ### What comes next from the companies in this race? (microsoft.com) OpenAI said individuals using its most cyber-capable and permissive models through Trusted Access for Cyber will be required to enable Advanced Account Security beginning June 1, 2026. That requirement applies as the company expands access to GPT-5.5 and GPT-5.5-Cyber for vetted defenders. Microsoft said MDASH remains in limited private preview, while Anthropic said Project Glasswing partners and additional infrastructure organizations are continuing to use Mythos Preview in defensive security work. (anthropic.com) Those next steps put the focus on who gets access, under what controls, and how quickly benchmark results translate into production security programs. (microsoft.com) (openai.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.