GPT-5.5 tops TerminalBench, teams report huge context‑window gains
- OpenAI launched GPT-5.5 on April 23, then added API access on April 24, pitching it as a stronger model for coding, research, and computer use. - The headline number is 82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7 at 69.4%, while API docs list a 1.05M-token context window. - That matters because teams can now hand one model much bigger, messier workflows without slowing down to constant prompt management.
OpenAI’s GPT-5.5 story is really about agentic work — models that don’t just answer, but keep going. That’s the gap everyone has been chasing. Benchmarks have improved for months, but the practical pain point stayed the same: once a task got long, messy, and tool-heavy, models lost the thread or got expensive fast. On April 23, 2026, OpenAI said GPT-5.5 pushes that boundary forward, and the interesting part is not just the scores — it’s the combination of stronger terminal performance, long context, and roughly unchanged serving speed. (openai.com) ### What actually launched? GPT-5.5 rolled out first in ChatGPT and Codex for Plus, Pro, Business, and Enterprise users, with GPT-5.5 Pro going to Pro, Business, and Enterprise. OpenAI then updated the launch on April 24 to say GPT-5.5 and GPT-5.5 Pro were also available in the API. That date matters because the early chatter mixed together ChatGPT availability, Codex avail(openai.com)not. (openai.com) ### Why are people focused on Terminal-Bench? Terminal-Bench 2.0 is a useful stress test for agentic coding because it asks a model to work inside realistic terminal environments instead of just emitting pretty code in a vacuum. GPT-5.5 scored 82.7% there, up from GPT-5.4’s 75.1% and ahead of Claude Opus 4.7 at 69.4% and Gemini 3.1 Pro at 68.5%. Basically, that suggests GPT(openai.com)ols, checking outputs, recovering from mistakes, and finishing. (openai.com) ### Is the context window really 400K? Sort of — but only in some places. OpenAI’s product post says GPT-5.5 in Codex has a 400K context window. The current API model page lists a much larger 1,050,000-token context window, plus special pricing once prompts go past 272K input tokens. So the clean way to read this is: 400K is the Codex product number people are talking about, while the API spec is now over 1M. (openai.com) ### Why does long context matter so much? Because prompt management is the hidden tax on agent workflows. If a model can hold a huge repo, a long research brief, tool outputs, and its own intermediate state in one working memory, you stop chunking everything by hand. That changes the shape of the job. Instead of babysitting a sequence of tiny prompts, teams can hand over a (openai.com)y, and revise with fewer resets. (openai.com) ### Did OpenAI say anything about speed? Yes — and this is one of the more important claims. OpenAI says GPT-5.5 matches GPT-5.4 on per-token latency in real-world serving while operating at a higher capability level. It also says the model uses significantly fewer tokens to complete the same Codex tasks. If that holds in practice, the gain is not just “smarter model.” It is “smarter without the usual slowdown penalty.” (openai.com) ### What’s the catch? The catch is that benchmark talk can blur together. The widely repeated Terminal-Bench lead is clearly in OpenAI’s launch materials. But some other numbers floating around — especially SWE-Bench Pro comparisons — come from different scaffolds, self-reports, or third-party summaries. So the solid takeaway is narrower: GPT-5.5 looks meaningfully better f(openai.com)s still depend on how the test is run. (openai.com) ### Why does this change how teams evaluate models? Because the unit of comparison is shifting from “best answer to one prompt” to “best finisher of one messy task.” That is a different contest. A model with stronger terminal behavior, better tool use, and very long context can win even if a smaller benchmark delta looks modest on paper. The practical question becomes: can (openai.com) that bar than the models right before it. (openai.com) ### Bottom line? This is less about one flashy benchmark and more about a threshold crossing. GPT-5.5 pairs a strong Terminal-Bench result with long-context handling and stable speed, which is exactly the combo teams need for autonomous coding and research workflows. The model race is starting to look less like a chatbot race and more like a “who can run the whole job” race. (openai.com)