GPT‑5.5 cracks ProgramBench
- OpenAI’s GPT-5.5 became the first model to fully resolve a ProgramBench task in May 2026, according to benchmark materials and third-party reports. (arxiv.org) - Meta FAIR’s May 6 paper said no tested model fully resolved any task; third-party reports later said GPT-5.5 solved one of 200 tasks. (arxiv.org) - ProgramBench code and leaderboard links are published in Meta FAIR’s repository, while OpenAI lists GPT-5.5 in ChatGPT, Codex and its API. (github.com)
Meta FAIR’s ProgramBench arrived on May 6 with a stark result: none of the nine language models in the paper fully resolved a single task. The benchmark asks agents to rebuild an entire program from a compiled binary and its documentation, then pass end-to-end behavioral tests generated through fuzzing. (arxiv.org) OpenAI released GPT-5.5 on April 23 and described the model as stronger on agentic coding while using fewer tokens on Codex tasks. Within days of ProgramBench’s release, third-party reports said GPT-5.5 had become the first model to fully resolve one ProgramBench task. (github.com) ### What does ProgramBench actually test that older coding benchmarks do not? ProgramBench’s authors — including John Yang, Kilian Lieret and Ofir Press of Meta FAIR, along with collaborators at Stanford and Harvard — framed the benchmark as a test of holistic software engineering rather than localized bug-fixing. Their paper says agents receive only a program and its documentation and must produce source code and a build script that recreate the reference executable’s behavior. The benchmark contains 200 tasks drawn from open-source software, according to the paper and repository. (arxiv.org) The authors wrote that earlier coding benchmarks often measured narrower work such as generating a function or fixing a single issue inside an existing codebase, while ProgramBench forces a model to choose architecture, language and implementation strategy on its own. ### Why was a single full solve notable? Meta FAIR’s May 6 paper said the best model in its initial evaluation passed 95% of tests on only 3% of tasks and that none fully resolved any task. (arxiv.org) That made the first reported full resolution notable because it moved the benchmark from a zero-solve regime to at least one completed case. Phemex News reported on May 13 that GPT-5.5 had become the first AI system to achieve a perfect score on the ProgramBench challenge. That report said the benchmark had been developed by Meta FAIR, Stanford and Harvard and described the task as reconstructing programs from compiled binaries without source code. (arxiv.org) ### What is verified, and what remains less certain? The May 6 arXiv paper verifies the benchmark design, the 200-task scope and the original zero-full-resolution result. OpenAI’s April 23 product post verifies that GPT-5.5 was released and that the company claims stronger coding and better token efficiency on Codex tasks. (arxiv.org) The specific claim that GPT-5.5 outperformed Claude Opus 4.7 on ProgramBench with fewer reasoning steps and multi-language flexibility is not described in the sources I could verify directly from Meta or OpenAI. Third-party coverage and reposted summaries describe those details, but I could not independently confirm them from an official ProgramBench leaderboard page or a benchmark post accessible through the site during reporting. (phemex.com) ### How does this fit with OpenAI’s own positioning for GPT-5.5? (arxiv.org) OpenAI said on April 23 that GPT-5.5 was built for “complex tasks like coding, research, and data analysis across tools.” The company also said the model “uses significantly fewer tokens” to complete the same Codex tasks and published comparison tables showing GPT-5.5 ahead of Claude Opus 4.7 on several other agentic benchmarks, including Terminal-Bench 2.0. That does not by itself prove the ProgramBench-specific comparisons circulating online. It does show that OpenAI was already presenting GPT-5.5 as a model optimized for longer-horizon tool use and coding work before the ProgramBench reports appeared. (phemex.com) ### Where can readers check the underlying materials? Meta FAIR’s GitHub repository links to the ProgramBench paper, website, leaderboard and usage guide. The repository says the benchmark is available as an installable package and lists recent releases, including version 1.0.2 three days before this report. (openai.com) OpenAI’s product page says GPT-5.5 rolled out to Plus, Pro, Business and Enterprise users in ChatGPT and Codex on April 23, with API availability updated on April 24. Those are the primary places to watch for any official replication notes, benchmark reports or customer case studies tied to ProgramBench-style coding evaluations. (openai.com) (github.com)