Claude Mythos hits 16-hour METR

- Anthropic’s Claude Mythos Preview just showed up on METR’s public time-horizon chart, with an early score brushing the edge of a 16-hour task window. - That number means Mythos can, about half the time, finish software tasks that take a human expert roughly 16 hours — or maybe more. - The catch is METR says anything above 16 hours is now hard to measure, so the benchmark may be hitting the ceiling.

A new AI benchmark result landed, and it matters because it gets at something people actually care about — not just whether a model can answer a hard question, but whether it can stay useful across a long, messy job. Anthropic’s Claude Mythos Preview now appears on METR’s public time-horizon chart at roughly the 16-hour mark, which puts it at the frontier of the group’s current measurement range. But the weirder part is that the benchmark may already be running out of room. ### What is METR measuring? METR’s “time horizon” is basically a durability test for software agents. It asks: for tasks that take a human expert some amount of time, how long can an AI keep making the right moves before it falls apart? The headline number people usually focus on is the 50% time horizon — the task length where the model is predicted to succeed half the time. (metr.org) ### Why is 16 hours a big deal? Because this is not a speed score. It is a task-length score. A 16-hour horizon means the model can complete, with about 50% reliability, software tasks that would take a skilled human around 16 hours. That is a lot closer to “take this project and come back tomorrow” than “solve this coding puzzle.” ### What changed with Mythos? METR updated its public chart on May 8, 2026 and added “Claude Mythos Preview (early).” The plotted point sits at the far right of the current scale, and METR added a note saying measurements above 16 hours are unreliable with its current task suite. (metr.org) That makes the result feel less like a clean finish line and more like the model running into the wall of the ruler. ### Is this Anthropic’s own claim? No — the key chart is METR’s. But Anthropic’s own materials line up with the idea that Mythos is a step up in sustained agentic work. The Mythos system card calls it Anthropic’s most capable frontier model to date, and the company’s security writeup says it made a striking leap over Claude Opus 4.6 on many benchmarks while showing unusually strong performance on long, multi-step cybersecurity tasks. (metr.org) ### Why does software matter so much here? Because software tasks are one of the cleanest places to test long-horizon agency. They have real goals, lots of intermediate steps, and clear failure modes. If a model can keep context, recover from mistakes, use tools, and finish a 16-hour-equivalent software job, that is a stronger sign of practical autonomy than acing another multiple-choice benchmark. (www-cdn.anthropic.com) ### So is Mythos definitely a 16-hour model? Probably “at least around there,” but not with perfect precision. METR explicitly warns that its current suite becomes unreliable above 16 hours. So the public takeaway is not “exactly 16.0.” It is “Mythos has reached the edge of what this version of the benchmark can confidently distinguish.” ### Why does that matter beyond leaderboard drama? (metr.org) Longer time horizons change what kinds of work become automatable. A model that can survive longer tasks can do more end-to-end engineering, research, and security work with less babysitting. That is useful, but it also raises the stakes on misuse and oversight — especially for a model Anthropic already describes as unusually capable in cyber contexts. (metr.org) ### Bottom line? The real news is not just that Claude Mythos Preview hit roughly 16 hours on METR. It is that one of the clearest public measures of agent durability may already be saturating. When the ruler stops being long enough, that usually means the underlying thing has changed. (metr.org) (red.anthropic.com)

Claude Mythos hits 16-hour METR

Get your own daily briefing