Robustness drop noted
- Social reports show state-of-the-art models lost about 22.8% success on recent robustness tests. (x.com) - Practitioners argue live, workflow-level evaluations now prioritize grounding and latency over leaderboard performance. (x.com) - That combination is pushing teams to prefer production-grounded metrics when choosing model updates rather than offline benchmarks alone. (x.com) (x.com)