Claude achieves 73% on AISI tests
- The U.K. AI Security Institute said on April 13 that Anthropic’s Claude Mythos Preview solved 73% of expert cyber tasks in testing. - AISI said on April 30 that OpenAI’s GPT-5.5 reached a 71.4% average pass rate on expert tasks, versus 68.6% for Mythos Preview. - AISI said both results came from controlled evaluations detailed on its April 13 and April 30 blog posts.
The U.K. AI Security Institute, or AISI, said on April 13 that Anthropic’s Claude Mythos Preview succeeded 73% of the time on expert-level cyber capture-the-flag tasks in controlled testing. AISI said the same evaluation found Claude Mythos Preview was the first model to complete its 32-step corporate-network attack simulation end-to-end. On April 30, AISI said OpenAI’s GPT-5.5 reached a similar level on the same test family and became the second model to finish that network simulation. The claims circulating on X this week trace back to those two AISI posts, not to a fresh May 22 government release. ### Where did the 73% figure come from? AISI said the 73% number came from expert-level tasks in its cyber capture-the-flag, or CTF, suite. The institute wrote that those expert tasks were at a difficulty level “which no model could complete before April 2025,” and said Mythos Preview “succeeds 73% of the time.” (aisi.gov.uk) AISI said its CTF tasks are meant to test skills such as vulnerability discovery and exploitation. The institute described the broader progression of its testing since 2023 as moving from chat-based probing to harder CTFs and then to multi-step cyberattack simulations. ### What is the corporate-network simulation people are referring to? (aisi.gov.uk) AISI said it built a test called “The Last Ones,” or TLO, as a 32-step corporate network attack simulation. The institute said the scenario spans “initial reconnaissance through to full network takeover” and estimated that it would take a human about 20 hours to complete. (aisi.gov.uk) AISI said Mythos Preview was the first model to complete that simulation end-to-end. In its April 30 post, the institute said GPT-5.5 was the second model to solve one of its multi-step cyberattack simulations end-to-end. ### Did GPT-5.5 actually “follow Claude” with comparable performance? (aisi.gov.uk) AISI said on April 30 that GPT-5.5 reached “a similar level of performance” to Mythos Preview on its cyber evaluations. The institute reported a 71.4% average pass rate for GPT-5.5 on expert tasks, compared with 68.6% for Mythos Preview in that comparison, and said GPT-5.5 “may be the strongest model we have tested” on that measure. (aisi.gov.uk) AISI also said the GPT-5.5 result helped answer what it called a “key question” after the Mythos evaluation: whether Claude’s result was a one-model breakthrough or part of a broader trend. AISI wrote that GPT-5.5’s results “suggest the latter.” ### Did AISI say capabilities are doubling every four months? (aisi.gov.uk) AISI’s April 13 and April 30 blog posts, as surfaced here, do not state that cyber capabilities are doubling every four months. The institute did say cyber performance had been “rapidly improving” and described Mythos Preview as “a step up over previous frontier models.” It also wrote that GPT-5.5’s results suggested a broader trend across developers. (aisi.gov.uk) The X post referenced in the brief could not be independently read through the available browser output here, so I could not verify who made the four-month doubling claim or whether it quoted an AISI contributor directly. The underlying AISI posts do verify the core benchmark numbers and the existence of the simulation screenshots and charts on AISI’s site. (aisi.gov.uk) ### What should readers treat as the verified version of this story? The verified record is AISI’s two dated posts: April 13 for Anthropic’s Claude Mythos Preview and April 30 for OpenAI’s GPT-5.5. Those posts say Claude Mythos Preview hit 73% on expert CTF tasks and first completed AISI’s 32-step corporate-network simulation, while GPT-5.5 later posted a 71.4% expert-task pass rate and became the second model to complete a multi-step cyberattack simulation end-to-end. (aisi.gov.uk) AISI’s next public updates on this topic are most likely to appear on its Work Blog, where both evaluations were published on April 13 and April 30. (aisi.gov.uk)