Outcome‑only metrics miss step failures
- Researchers warned that evaluating only final outcomes lets agents pass by luck while internal steps and tool calls remain broken and unobserved. (x.com) - They advocate step‑level and trajectory evaluations that log intermediate actions, retries and tool errors so brittle behaviors surface. (x.com) - Adopting those granular metrics makes it possible to attach concrete runtime alarms and fix pipelines before user‑visible failures occur. (x.com)