GPT-5.5 and Mythos surpass benchmarks

- OpenAI and Anthropic models cleared harder cyber benchmarks in May 2026, with British government testing showing GPT-5.5 reached Claude Mythos-level performance. - AISI said GPT-5.5 posted a 71.4% expert-task pass rate versus Mythos Preview’s 68.6%, and solved a 32-step attack simulation end-to-end. (aisi.gov.uk) - OpenAI launched Daybreak on May 13, 2026, offering tiered cyber-defense access built around GPT-5.5, Trusted Access for Cyber, and GPT-5.5-Cyber. (cyberscoop.com)

Britain’s AI Security Institute said on April 30 that OpenAI’s GPT-5.5 had become the second model to complete one of its multi-step cyberattack simulations end-to-end, putting it at roughly the same level as Anthropic’s Claude Mythos Preview on the agency’s most demanding tests. The result mattered because Anthropic had framed Mythos as a tightly restricted model with unusual cyber capability, while GPT-5.5 is far more broadly available. (aisi.gov.uk) Separate reporting published on May 13 said Palo Alto Networks reached a similar conclusion after testing frontier models on vulnerability research and exploit development. OpenAI, meanwhile, used the moment to launch Daybreak, a new cyber-defense product built around the same family of models. (cyberscoop.com) ### How far did GPT-5.5 actually move on the benchmark? AISI said GPT-5.5 was “one of the strongest models” it had tested and the second system to solve one of its multi-step cyberattack simulations from start to finish. The institute said its earlier April evaluation of an Anthropic Mythos Preview snapshot had found the first model to complete that exercise, which it estimates would take a human about 20 hours. The April 30 evaluation said GPT-5.5 achieved a 71.4% average pass rate on expert-level tasks, compared with 68.6% for Mythos Preview, 52.4% for GPT-5.4 and 48.6% for Anthropic’s Opus 4.7. (aisi.gov.uk) AISI said the expert suite covered realistic vulnerability research and exploitation tasks, including reverse engineering, web exploitation and cryptography. ### What did the U.K. institute say about the pace of progress? CyberScoop reported on May 13 that AISI believed both Claude Mythos Preview and GPT-5.5 had moved beyond the doubling trend line it had been tracking since late 2024. (aisi.gov.uk) The institute had previously estimated that the “80% reliability cyber time horizon” for frontier models was doubling about every five months, down from roughly eight months in an earlier November 2025 estimate. AISI wrote that frontier AI’s autonomous cyber and software capability was advancing on the scale of months, not years, according to CyberScoop’s account of the findings. (aisi.gov.uk) In the institute’s cyber ranges, CyberScoop said, a newer Mythos checkpoint solved “The Last Ones,” a 32-step simulated corporate network attack, in 6 of 10 attempts and “Cooling Tower” in 3 of 10 attempts, while GPT-5.5 solved “The Last Ones” in 3 of 10 attempts. ### Why did Bruce Schneier’s reaction get attention? Bruce Schneier wrote on May 13 that AISI had found GPT-5.5 “comparable to Claude Mythos” at finding security vulnerabilities. (cyberscoop.com) That mattered because Schneier’s summary cut against the idea that Mythos represented a one-off leap confined to a single, tightly held model. AISI itself raised the same question in its April 30 post, asking whether Mythos reflected a model-specific breakthrough or a broader trend. Its answer, based on GPT-5.5’s results, was that a second model from a different developer had now reached a similar level on its cyber evaluations. (cyberscoop.com) ### What did outside testing find beyond the government benchmark? Palo Alto Networks said in findings cited by CyberScoop on May 13 that the newest models were “extraordinarily capable” at finding vulnerabilities and turning them into critical exploit paths in near-real time. (schneier.com) The company said it had tested Claude Mythos in April through Anthropic’s Project Glasswing and later tested OpenAI’s GPT-5.5-Cyber through OpenAI’s Trusted Access for Cyber program. The same report said Palo Alto released advisories covering 26 CVEs representing 75 issues found through AI model scanning across more than 130 products, compared with a typical monthly volume of fewer than five CVEs. (aisi.gov.uk) That figure gave one of the clearest outside signs that the benchmark gains were showing up in structured industrial testing, not only in government-run ranges. ### Where does Daybreak fit into this? OpenAI unveiled Daybreak on May 13 as a cybersecurity initiative that combines its language models with its Codex agentic framework to help companies identify, patch and validate software vulnerabilities. (cyberscoop.com) CyberScoop reported that the platform has three tiers: standard GPT-5.5, GPT-5.5 with Trusted Access for Cyber for verified defensive work, and GPT-5.5-Cyber for more specialized uses such as authorized red-teaming and penetration testing. Anthropic has kept Mythos tightly restricted and has not made it commercially available, according to CyberScoop. (cyberscoop.com) OpenAI said Daybreak paired broader capability with identity checks, access controls and account-level oversight for the highest-capability tier. May 13 also marked the public rollout of OpenAI’s answer to Anthropic’s Project Glasswing, with named partners and access controls likely to determine how quickly these systems move from benchmark exercises into routine security workflows. OpenAI said GPT-5.5-Cyber remains in preview under controlled conditions, while AISI’s April 30 benchmark remains the clearest public yardstick for comparing GPT-5.5 and Mythos directly. (cyberscoop.com) (aisi.gov.uk)

GPT-5.5 and Mythos surpass benchmarks

Get your own daily briefing