Claude Mythos scores 3/10 in UK tests

- Anthropic’s Claude Mythos Preview did not score “3 out of 10” on a UK safety test. It completed AISI’s simulated corporate attack in 3 of 10 runs. - The UK AI Security Institute then tested OpenAI’s GPT-5.5, which finished the same 32-step attack in 2 of 10 runs and slightly beat Mythos on expert tasks. - That matters because the result shifts the story from one-off hype to a broader trend — frontier models are getting materially better at offensive cyber work.

This is a cybersecurity story, not a generic “AI safety score” story. The number floating around — 3 out of 10 — makes it sound like Anthropic’s Claude Mythos Preview flunked a government exam. But turns out that’s the wrong frame. The UK AI Security Institute, or AISI, wasn’t grading Mythos on a 10-point safety scale. It was measuring whether the model could complete a long, realistic cyberattack simulation from start to finish — and Mythos managed it in 3 of 10 attempts. GPT-5.5 then did the same thing in 2 of 10. (aisi.gov.uk) ### What was actually tested? AISI ran two kinds of evaluations. One was a set of capture-the-flag tasks — basically focused hacking challenges that test skills like reverse engineering, cryptography, and vulnerability exploitation. The other was much closer to a real attack chain: a 32-step corporate network simulation called “The Last Ones,” built to mimic the slog of moving (aisi.gov.uk)aisi.gov.uk) ### So what does “3 out of 10” mean? It means Mythos completed that full 32-step simulation end-to-end in 3 of 10 runs. Not that it earned a 30% safety grade. That distinction matters a lot, because this was an autonomy and capability test — can the model keep a long offensive operation on track — not a report card on whether Anthropic built a “safe” model. GPT-5.5 later comple(aisi.gov.uk)aisi.gov.uk) ### Did GPT-5.5 beat Mythos or not? Depends on the slice. On the long attack simulation, Mythos is still ahead — 3 successful runs versus 2. But on AISI’s expert-level narrow cyber tasks, GPT-5.5 posted a 71.4% average pass rate, compared with 68.6% for Mythos. So one model looks slightly better at the marathon, the other slightly better at the hardest isolated technical problems. The bigger point is that they’re now in the same band. (aisi.gov.uk) ### Why is that a bigger deal than it sounds? Because the April Mythos result could have been dismissed as a weird one-off — one unusually capable model from one company. The GPT-5.5 result makes that harder to say. AISI’s own framing is basically: this looks less like a single breakthrough and more like a broader capability trend across frontier models. That is the real news here. (([aisi.gov.uk)### Does this mean these models are ready to hack the real world? Not exactly. These are controlled environments with vulnerable targets, fixed objectives, and lots of structure. That still matters — a lot — because the tests are designed to be realistic enough to show what the models can chain together autonomously. But it does not mean a model can just be dropped onto any hardened e(aisi.gov.uk)more like “the lab version of that future is getting uncomfortably real.” (aisi.gov.uk) ### Where does Anthropic fit in this? Anthropic has been unusually blunt about why Mythos matters. Its security team says the model is strong enough at finding and exploiting serious software flaws that the company launched a defensive effort around it, called Project Glasswing. Anthropic also says Mythos found zero-day vulnerabilities across major operating systems and browsers during internal testing, though many details are withheld because the bugs are still sensitive. (red.anthropic.com) ### Why are people arguing about “enterprise readiness”? Because capability and deployability are different questions. A model can be powerful enough to worry defenders while still being too unreliable, too restricted, or too risky to hand broad autonomy inside a real company. The current AISI numbers say these systems are crossing a threshold in offensive cyber capability. They do not settle whether vendors have(red.anthropic.com)use. (aisi.gov.uk) ### Bottom line? The cleanest way to read this story is simple: Claude Mythos Preview did not get a 3/10 “safety score.” It succeeded in 3 of 10 runs on a hard UK cyberattack simulation. GPT-5.5 followed with 2 of 10. The scary part is not who won by one run — it’s that more than one frontier model can now do this at all.

Claude Mythos scores 3/10 in UK tests

Get your own daily briefing