METR long‑horizon tests show variance

- The METR evaluation ran long‑horizon agent scenarios spanning 2–30 hours and reported wide performance variance across those extended tasks. (x.com) - In parallel live benchmarks researchers noted 13 models topped 66.7% on CRM/HR tasks but scored 0% on management tasks, exposing task‑specific weaknesses. (x.com) - Those results argue for multi‑axis measurement (stepwise competence, recovery, and constraint handling) not just single outcome scores. (x.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.