METR long‑horizon tests show variance
- The METR evaluation ran long‑horizon agent scenarios spanning 2–30 hours and reported wide performance variance across those extended tasks. (x.com) - In parallel live benchmarks researchers noted 13 models topped 66.7% on CRM/HR tasks but scored 0% on management tasks, exposing task‑specific weaknesses. (x.com) - Those results argue for multi‑axis measurement (stepwise competence, recovery, and constraint handling) not just single outcome scores. (x.com)