New safety numbers surfaced

Circulating safety metrics claim the withheld 'Mythos' model exploited tested vulnerabilities at about 70%, compared with roughly 0.8% for Opus 4.6. (x.com) Separate analyses cited roughly 14.6 harmful clinical recommendations per 100 cases in leading LLMs, and SANS warned that real‑world LLM production failures nearly caused incidents — all signals used to justify tighter release controls. (x.com) (x.com)

A new Anthropic system card says Claude Mythos Preview was withheld from general release after internal testing found a sharp jump in dangerous cyber capability. (anthropic.com) Anthropic published the Mythos Preview system card on April 7, 2026, and said the model is being used only in a “defensive cybersecurity program” with a limited set of partners. The company said Mythos is its “most capable frontier model to date” and outperforms Claude Opus 4.6 on many benchmarks. (anthropic.com) Anthropic’s earlier Opus 4.6 release, published February 5, 2026, took a different path. Its system card said Opus 4.6 was deployed under the company’s Artificial Intelligence Safety Level 3 standard after testing found “a comparably low rate of overall misaligned behavior” relative to Opus 4.5. (anthropic.com 1) (anthropic.com 2) Large language models are text-prediction systems trained on vast datasets, but the safety fight around them increasingly centers on what they can do with tools, code, and long chains of actions. In cybersecurity, that means reading software the way a human researcher would and spotting flaws that automated fuzzers can miss. (anthropic.com) Anthropic said that shift was already visible in Opus 4.6. In a February 5 research post, the company said the model found high-severity vulnerabilities in heavily tested open-source codebases and that its teams had validated more than 500 such flaws and started reporting patches. (anthropic.com) The new Mythos card turns that capability into a release decision. Anthropic said the model’s “large increase in capabilities” was enough to keep it off the open market for now and to use the findings to shape future safeguards and releases. (anthropic.com) Outside Anthropic, recent safety work in medicine and security has pointed in the same direction: stronger models can still fail in high-stakes settings. A late-2025 preprint titled *First, do NOHARM* was produced by researchers from Stanford, Harvard, and other institutions to measure how often clinical recommendations from large language models become harmful. (arxiv.org) Security operators have reported similar problems in production. A SANS Institute presentation updated March 17, 2026 said “hallucinating” large language models nearly caused priority-one incidents in real security workflows and argued for human checkpoints, access controls, and tighter data handling. (sans.org) Anthropic has framed Mythos as part of a narrower “Project Glasswing” effort to help secure critical software rather than a consumer launch. The company’s public materials say the model can find working exploits quickly enough that even engineers without formal security training were able to use it to produce remote-code-execution exploits overnight in testing. (red.anthropic.com) The immediate question is no longer whether frontier models can uncover serious software flaws. Anthropic’s latest answer is that some of them are now strong enough to change how — and whether — they get released. (anthropic.com)

New safety numbers surfaced

Get your own daily briefing