GPT-5.5 solves ProgramBench task

- Quasa said on May 22 that GPT-5.5 solved a ProgramBench task, a coding benchmark where prior evaluated models had failed to fully resolve any task. - ProgramBench’s May 5 paper said no tested model fully resolved any task; Quasa’s May 22 post said GPT-5.5 beat Claude Opus 4.7. - ProgramBench’s paper, code and leaderboard remain available through the project repository and linked website for follow-up checks.

Quasa reported on May 22 that OpenAI’s GPT-5.5 solved what it described as its first real ProgramBench task and outperformed Anthropic’s Claude Opus 4.7 on the metrics shown in Quasa’s write-up. The claim matters because ProgramBench is a new software-engineering benchmark built around reconstructing full programs from binaries and documentation rather than patching an existing codebase. The benchmark’s authors said in a May 5 paper that none of the nine language models they evaluated fully resolved any task. OpenAI released GPT-5.5 on April 23, and Anthropic released Claude Opus 4.7 on April 16. ### What is ProgramBench actually testing? ProgramBench asks AI agents to rebuild a complete codebase from a compiled executable and its documentation, according to the project’s GitHub repository and paper. The benchmark uses end-to-end behavioral tests generated through agent-driven fuzzing, and its 200 tasks range from compact command-line tools to larger software including FFmpeg, SQLite and the PHP interpreter. (arxiv.org) The May 5 paper by John Yang, Kilian Lieret, Jeffrey Ma and co-authors said existing benchmarks tend to measure narrower tasks such as bug fixing or feature work inside an existing repository. ProgramBench was introduced to measure whether software agents can make higher-level architecture and implementation decisions when building software holistically, the authors wrote. ### Why did Quasa’s May 22 post stand out? (github.com) Quasa’s May 22 article said GPT-5.5 had solved its first ProgramBench task and had beaten Claude Opus 4.7 on code-correctness and execution-time measures in tables published with the story. The article also said it included method notes describing how the comparison was run. (arxiv.org) The underlying ProgramBench paper had set a low baseline for success. The authors wrote that none of the nine evaluated models fully resolved any task and that the best model passed 95% of tests on only 3% of tasks. Against that backdrop, any report of a fully solved task is a notable change from the paper’s initial results. ### How do GPT-5.5 and Claude Opus 4.7 fit into this? (quasa.io) OpenAI said on April 23 that GPT-5.5 was built for complex work across tools, including coding, research and data analysis. OpenAI made GPT-5.5 available in ChatGPT on April 23 and said on April 24 that GPT-5.5 and GPT-5.5 Pro were available in the API. Anthropic said on April 16 that Claude Opus 4.7 was generally available and that it improved on Opus 4.6 in advanced software engineering, especially on harder tasks. (arxiv.org) Anthropic also said Opus 4.7 was designed for long-running coding work with sustained effort in larger codebases. ### What can be verified from primary sources, and what cannot? The ProgramBench project and paper can be verified directly. (openai.com) They show the benchmark’s design, the authors, the May 5 publication date, and the paper’s statement that no tested model fully resolved any task in the original evaluation. Quasa’s exact benchmark tables could not be independently inspected through the available page fetch in this session, so the specific numerical margins in its May 22 comparison could not be confirmed directly from the article text here. (anthropic.com) What can be verified is that Quasa published coverage comparing GPT-5.5 and Claude Opus 4.7, and that ProgramBench’s official paper describes the benchmark as difficult enough that initial evaluated models all fell short of full task resolution. (github.com) ### Where would readers look next? The ProgramBench repository links to the project website, paper, leaderboard and usage guide, which are the clearest places to watch for updated results. OpenAI’s GPT-5.5 release page and Anthropic’s Claude Opus 4.7 announcement remain the primary sources for the two models named in Quasa’s comparison. (github.com) (quasa.io)

GPT-5.5 solves ProgramBench task

Get your own daily briefing