Claude Mythos exceeds METR time horizon

- METR updated its frontier AI time-horizon tracker on May 8 and placed Anthropic’s Claude Mythos Preview beyond the benchmark’s 16-hour measurement ceiling. - The key catch is methodological, not just bragging rights — METR now labels results above 16 hours “unreliable with our current task suite.” - That matters because METR’s own earlier trend already showed horizons doubling about every 7 months, and the newest chart now looks faster.

METR’s time-horizon chart is one of the clearest ways to track what frontier models are getting better at. Not raw benchmark points. Not vibes. The question is simpler — how long a real task, measured in human expert hours, can a model finish on its own before it falls apart. This week, that chart got a new awkward data point: Anthropic’s Claude Mythos Preview landed past the top of the scale. ### What is the “time horizon” here? METR defines a model’s task-completion time horizon as the length of a task where the model can succeed with a given reliability, usually 50% or 80%. The tasks are software-heavy and varied, and METR fits a curve from many task results rather than saying “the model did one 12-hour thing once.” Basically, it is trying to answer a practical question: how long can an agent keep working productively without supervision? (metr.org) ### What changed this week? On May 8, 2026, METR updated its public time-horizons page and added “Claude Mythos Preview (early)” to the frontier chart. The model sits beyond the 16-hour mark on the 50% success graph, and METR added a warning right on the chart that measurements above 16 hours are unreliable with the current task suite. That is the actual news — not that Mythos “proved” a clean 16-plus-hour capability, but that it pushed past what METR says its present benchmark can confidently resolve. (metr.org) ### Why is 16 hours such a weird boundary? Because the benchmark is running out of runway. METR’s current suite has been expanded, but even after the January TH1.1 update it still only has a limited number of very long tasks. METR said it increased the suite from 170 to 228 tasks and raised the count of tasks estimated at 8 hours or longer from 14 to 31. That helps, but if a model starts clustering at the top end, the benchmark stops being a ruler and starts being a ceiling. (metr.org) ### So did Mythos “beat” the benchmark? In one sense, yes — it exceeded the benchmark’s comfortable measuring range. But the catch is important. Crossing the ceiling is not the same thing as establishing a precise new number. The right read is that Claude Mythos looks stronger than the current long-task suite can cleanly quantify, not that we now know its exact autonomous limit. METR is being unusually explicit about that. (metr.org) ### Why do people care about this metric? Because time horizon maps better to real work than most AI evals do. A model that can reliably finish a 10-minute task is useful. A model that can stay coherent across multi-hour debugging, research, or implementation work changes the shape of products and teams. METR’s original 2025 writeup made the bigger point plainly: this metric had been rising exponentially for years, with a doubling time around 7 months. (metr.org) ### Why does this feel faster now? The public chatter is reacting to the slope, not just the level. Earlier METR work already hinted that growth might be accelerating relative to the older six-year trend, and the newest chart visually compresses the gaps between frontier releases near the top end. That does not mean every claimed “89-day doubling” is now settled fact. But it does mean the frontier is moving fast enough that the benchmark itself needs another upgrade. (metr.org) ### What does this mean for startups? If your product roadmap assumes frontier models improve on a slow, yearly cadence, this kind of result is bad news. The window between “barely useful” and “good enough to automate a chunk of the workflow” can shrink fast. But the opposite is also true — if benchmarks are saturating, companies can overread splashy numbers and build around capabilities that are still noisy at the edge. (metr.org) ### Bottom line? Claude Mythos did not just post a bigger score. It exposed a measurement problem. And when the benchmark starts topping out, that usually means the capability race has moved to a new phase. (metr.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.