GPT-5.5 edges Claude Mythos
- OpenAI released GPT-5.5 on April 23 and said it now leads several agentic-work benchmarks, while Anthropic kept Claude Mythos Preview limited and shipped Claude Opus 4.7 instead. - OpenAI’s chart shows GPT-5.5 at 82.7% on Terminal-Bench 2.0 versus Claude Opus 4.7 at 69.4%, with GPT-5.5 matching GPT-5.4 latency in real-world serving. - The comparison lands in a crowded April market of new coding and agent models from Google, Meta, xAI, and Z.ai. (openai.com)
OpenAI launched GPT-5.5 on April 23 and pitched it as a model built for “real work,” with benchmark gains over Claude Opus 4.7 and Gemini 3.1 Pro. (openai.com) OpenAI said GPT-5.5 scores 82.7% on Terminal-Bench 2.0, 78.7% on OSWorld-Verified, and 84.4% on BrowseComp. In the same chart, Claude Opus 4.7 posts 69.4% on Terminal-Bench 2.0 and 79.3% on BrowseComp, while Gemini 3.1 Pro posts 68.5% and 85.9%. (openai.com) The company also said GPT-5.5 matches GPT-5.4 per-token latency in real-world serving and uses fewer tokens on Codex tasks. OpenAI rolled it out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex, then added API access for GPT-5.5 and GPT-5.5 Pro on April 24. (openai.com 1) (openai.com 2) Anthropic’s closest frontier model is harder to compare directly because Claude Mythos Preview is still a limited release. Anthropic said on April 16 that Claude Opus 4.7 is “less broadly capable” than Mythos Preview and is the model it is broadly shipping while it tests cyber safeguards. (anthropic.com) (red.anthropic.com) That makes the headline less “GPT-5.5 beat Mythos” than “OpenAI broadly shipped a new flagship while Anthropic held back its most powerful preview.” Anthropic said lessons from Opus 4.7’s deployment will inform its “eventual goal” of a broad release of Mythos-class models. (anthropic.com) The underlying contest is about agentic software: models that do more than answer questions and can browse, use tools, write code, check results, and keep working across long tasks. OpenAI said GPT-5.5 was trained and evaluated for coding, computer use, online research, spreadsheets, and document workflows. (openai.com 1) (openai.com 2) April also brought competing pushes from other labs. Google said Gemini 3.1 Pro is aimed at complex tasks and built the new Deep Research agents on it, while Z.ai said GLM-5.1 is designed for long-horizon engineering work and can run on a single task for up to eight hours. (blog.google 1) (blog.google 2) (z.ai) (docs.z.ai) Meta’s most recent April model announcement was Muse Spark for Meta products, not a new Llama 4 release, while xAI’s developer docs now point customers to grok-4.20 as its recommended API model. That leaves a market where the names change quickly, but the buying questions stay fixed: which model is available now, how long can it stay on task, and what safeguards come with it. (about.fb.com) (docs.x.ai) For now, OpenAI has the cleaner claim: a newly shipped model, public benchmark tables, and broad product access. Anthropic’s stronger claim may still sit inside Mythos Preview, but Anthropic has not broadly released that model yet. (openai.com) (anthropic.com)