GPT-5.5 breaks cyber capability benchmarks
- On May 13, CyberScoop reported that OpenAI’s GPT-5.5 and Anthropic’s Claude Mythos Preview exceeded prior autonomous cyber capability trend lines in separate studies. (cyberscoop.com) - Britain’s AI Security Institute said GPT-5.5 solved one multi-step corporate attack simulation end-to-end and posted a 71.4% expert-task pass rate. (aisi.gov.uk) - OpenAI on May 13 unveiled Daybreak, with GPT-5.5, Trusted Access for Cyber, and GPT-5.5-Cyber tiers for vetted defenders. (cyberscoop.com)
Two new sets of results pushed the AI-and-cybersecurity debate into a more concrete phase this week. On May 13, CyberScoop reported that OpenAI’s GPT-5.5 and Anthropic’s Claude Mythos Preview outperformed the trend lines researchers had been using to track autonomous cyber progress. Britain’s AI Security Institute, or AISI, said the models surpassed the pace it had been measuring since late 2024, while Palo Alto Networks said recent models were finding vulnerabilities and turning them into exploitable attack paths much faster than before. (aisi.gov.uk) (cyberscoop.com) The shift matters because the benchmarks here are not abstract coding tests. AISI’s work includes structured attack simulations against small enterprise networks and advanced vulnerability-research tasks designed to mimic real security work. (cyberscoop.com) OpenAI, meanwhile, is trying to channel that capability into a controlled product push called Daybreak, announced May 13. ### What exactly did the new models do? AISI said Claude Mythos Preview became the first model to complete both of its cyber-range exercises, including “The Last Ones,” a 32-step simulated corporate network attack, and “Cooling Tower,” which no model had previously solved. GPT-5.5 also crossed a threshold AISI had been watching for months: it became the second model to solve one of those multi-step attack simulations end-to-end. (cyberscoop.com) CyberScoop reported that Claude Mythos solved “The Last Ones” in 6 of 10 attempts and “Cooling Tower” in 3 of 10 attempts. GPT-5.5 solved “The Last Ones” in 3 of 10 attempts, according to the same report. (cyberscoop.com) ### Why are researchers calling this a broken benchmark? AISI said earlier in 2026 that frontier models’ “80% reliability cyber time horizon” had been doubling about every five months, down from an eight-month estimate in November 2025. That metric uses the length of a task a human expert would need to complete as a proxy for how much cyber work a model can do autonomously. (cyberscoop.com) The institute said GPT-5.5 and Claude Mythos both exceeded that already-fast trend. AISI wrote that autonomous cyber capability has been doubling “on the order of months, not years,” and CyberScoop said researchers do not yet know whether this was a one-off jump or the start of a steeper curve. (cyberscoop.com) ### How close were GPT-5.5 and Mythos on vulnerability-finding? AISI’s April 30 evaluation said GPT-5.5 reached a “similar level of performance” to Claude Mythos Preview on its cyber tests. On expert-level tasks, AISI reported an average pass rate of 71.4% for GPT-5.5, versus 68.6% for Mythos Preview, with both ahead of earlier frontier models including GPT-5.4 and Opus 4.7. (cyberscoop.com) Bruce Schneier wrote this week that AISI found GPT-5.5 “comparable” to Claude Mythos at finding security vulnerabilities. Schneier also noted that GPT-5.5 is generally available, a contrast with Anthropic’s tighter restrictions on Mythos access. (cyberscoop.com) ### What did Palo Alto Networks say it saw in separate testing? Palo Alto Networks told CyberScoop it began testing Claude Mythos in April as a launch partner for Anthropic’s Project Glasswing and later tested OpenAI’s GPT-5.5-Cyber through OpenAI’s Trusted Access for Cyber program. The company said the latest models were “extraordinarily capable” at finding vulnerabilities and converting them into critical exploit paths in near-real time. (aisi.gov.uk) The company said it issued advisories for 26 CVEs covering 75 issues found through AI model scanning across more than 130 products, compared with a typical monthly volume of fewer than five CVEs, according to CyberScoop’s report. (schneier.com) ### Where does OpenAI’s Daybreak fit into this? OpenAI on May 13 unveiled Daybreak as a cybersecurity initiative combining GPT-5.5 with its Codex framework to help organizations identify, patch and validate software vulnerabilities. CyberScoop reported that the offering has three tiers: standard GPT-5.5, GPT-5.5 with Trusted Access for Cyber for verified defensive work, and GPT-5.5-Cyber for more specialized uses such as authorized red-teaming and penetration testing. (cyberscoop.com) OpenAI said the higher-capability tiers come with stronger identity verification and account-level oversight. CyberScoop said Anthropic has kept Mythos tightly restricted and has not made it commercially available, while OpenAI argued in an earlier post that it does not think it is “practical or appropriate” to centrally decide who gets to defend themselves. (cyberscoop.com) ### What happens next? May 13 left two parallel tracks in view. AISI’s benchmark work is now being used to judge whether the latest results were an isolated jump or a new pace of capability growth, and OpenAI is moving Daybreak through controlled access for verified defenders. (cyberscoop.com) Palo Alto Networks, Cisco and CrowdStrike were among the named companies connected to the current wave of testing or deployment, according to CyberScoop’s reporting on the benchmark results and Daybreak launch. Those evaluations and limited-access rollouts are likely to be the next public checkpoints for how much these models change day-to-day offensive and defensive security work. (cyberscoop.com 1) (cyberscoop.com 2)