OpenAI retires SWE-bench
- OpenAI said on February 23 it will stop using SWE-bench Verified for frontier coding claims after finding the benchmark now mismeasures model progress. - In an audit of 138 hard tasks, OpenAI said 59.4% had flawed tests, and all frontier models showed signs of training contamination. - SWE-bench Verified launched in August 2024 as a 500-task subset of SWE-bench, but OpenAI now recommends SWE-bench Pro. (openai.com)
OpenAI said on February 23 that SWE-bench Verified no longer measures frontier coding capability well enough to use in new model launches. (openai.com) SWE-bench is a coding benchmark built from real GitHub bug reports and fixes. A model gets an issue description and a code repository, then its patch is judged by whether hidden tests pass. (openai.com) (github.com) OpenAI created SWE-bench Verified in August 2024 as a cleaner 500-task subset of the original benchmark after reviewing 1,699 tasks with human annotators. Each task was checked by three experts to filter out impossible or misleading cases. (openai.com) By February 2026, OpenAI said state-of-the-art scores on SWE-bench Verified had slowed, rising from 74.9% to 80.9% over the previous six months. That pushed the company to ask whether the remaining misses were model failures or benchmark failures. (openai.com) OpenAI then audited a 27.6% slice of the dataset that models frequently failed. In that review, it said at least 59.4% of the audited problems had flawed tests that rejected functionally correct fixes. (openai.com) The company broke those flaws into two main buckets. It said 35.5% of the reviewed tasks used overly narrow tests tied to a specific implementation, while 18.8% checked for behavior not described in the issue itself. (openai.com) OpenAI gave one example from pylint, where tests imported a new function named `get_annotation` even though that function name never appeared in the problem statement. A model could solve the bug another way and still fail the benchmark. (openai.com) The audit found a second problem: contamination. OpenAI said every frontier model it tested could reproduce the original human-written “gold patch” or verbatim details from some problem statements, suggesting the benchmark had leaked into training data. (openai.com) That matters because SWE-bench problems come from open-source repositories widely used in model training. OpenAI said models that had seen those problems during training were more likely to pass tests, especially when the tests were underspecified. (openai.com) OpenAI did not say SWE-bench Verified is useless for all comparisons. It said the benchmark no longer works for measuring progress at today’s frontier level, and it recommended SWE-bench Pro as the public benchmark to report instead. (openai.com) The shift rewrites a metric OpenAI itself introduced as more reliable less than two years ago. The company now says the test that once helped track coding progress is too brittle and too contaminated to keep serving as the headline score. (openai.com 1) (openai.com 2)