Scale publishes SWE‑Bench Pro findings
- Scale AI surfaced four ICML 2026 papers, including SWE‑Bench Pro, a 1,865‑enterprise‑task suite showing model performance drops as task complexity rises. (x.com) - SWE‑Bench Pro covers 1,865 enterprise tasks and shows accuracy declines on higher‑order problems, while OEC imitation learning produced roughly +13–14% gains in those tests. (x.com) - The papers suggest big public benchmarks can mask failure modes on complex, production‑grade tasks at scale. (x.com)