OpenAI GPT‑5.5 bests Opus 4.7
- OpenAI’s GPT‑5.5 pulled ahead of Anthropic’s Opus 4.7 in fresh coding and agent benchmarks published May 1-2, 2026 by ARC Prize and developers. - The cleanest gap came on Terminal‑Bench 2.0: GPT‑5.5 scored 82.7% versus Opus 4.7 at 69.4%, while ARC‑AGI‑3 showed 0.43% versus 0.18%. - The bigger lesson is routing — GPT‑5.5 looks stronger for terminal-heavy shipping work, but model choice still depends on codebase, cost, and latency.
Coding models are starting to split into specialties. That matters because teams are no longer asking which model is smartest in the abstract — they’re asking which one actually ships code, survives review, and doesn’t waste time. The new wrinkle this week is that OpenAI’s GPT‑5.5 seems to have taken a real lead over Anthropic’s Opus 4.7 on several agent-style coding tasks, especially the ones that look like actual terminal work rather than isolated puzzle solving. (xda-developers.com) ### What changed this week? Two useful comparisons landed on May 1 and May 2, 2026. ARC Prize published a breakdown of GPT‑5.5 and Opus 4.7 on ARC‑AGI‑3, and a separate real-repo coding writeup compared the models across 56 tasks pulled from two open-source repositories. Around the same time, GPT‑5.5’s Terminal‑Bench 2.0 score became the number everybody latched onto because it maps closely to how terminal coding agents are actually used. (arcprize.org) ### Why is Terminal‑Bench the eye-catcher? Because it tests command-line workflows — basically the bread and butter of tools like Codex CLI, Claude Code, and OpenCode. In the comparison cited this week, GPT‑5.5 scored 82.7% and Opus 4.7 scored 69.4%. That is a big enough gap to matter in practice, not just on a leaderboard, because these tools spend their lives reading files, editing code, running commands, and reacting to failures. (xda-developers.com) ### What about the 56-repo test? That one is more grounded than a public benchmark, which is why people are paying attention to it. The setup used 56 real coding tasks from two open-source repos — 27 from Zod and 29 from another repository — and ran each model in its native harness, with Opus 4.7 in Claude Code and GPT‑5.5 in Codex CLI. G(xda-developers.com)rvived code review more often. (blog.donweb.com) ### So did Opus lose everywhere? No — and that’s the part people flatten too quickly. In the same repo study, Opus 4.7 wrote patches that were 30% to 40% smaller. Sometimes that’s exactly what you want. Smaller diffs can mean less review overhead and less risk of collateral damage. But in at least one repo, that compactness turned into under-implementation — the patch was neat, but incomplete. (blog.donweb.com) ### What does ARC‑AGI‑3 add here? ARC‑AGI‑3 is not a coding benchmark. It’s a novelty-and-adaptation test built around unfamiliar interactive environments. Humans can solve all of its environments, while frontier models are still below 1%, so everyone is bad there in absolute terms. But the relative gap still tells you something: GPT‑5.5 scored 0.43% and Opus 4(blog.donweb.com)on adaptation somewhat better in that setup. (arcprize.org) ### Why doesn’t one benchmark settle it? Because benchmarks compress behavior into one number. Two models can look close overall and still fail in very different ways on your codebase. The repo study makes this point directly: public scores like SWE-bench can miss the difference between a model that writes bigger but shippable patches and one that writes elegant but occas(arcprize.org)r this workflow.” (blog.donweb.com) ### What should teams actually do? Route by task. Use GPT‑5.5 when the job is terminal-heavy, multi-step, and judged by whether code really lands. Keep testing Opus 4.7 where concise edits and lower code footprint matter more. And run evals on your own repos, because the catch is that neither public leaderboards nor vendor demos can tell you how a model behaves inside your actual review culture. (blog.donweb.com) ### Bottom line The story is not that one lab permanently won. It’s that GPT‑5.5 just posted a more convincing case for real-world coding work than Opus 4.7 did this week — and the era of one-model-fits-all is ending. (arcprize.org)