Composer 2.5 nearly ties Claude
- Cursor released Composer 2.5 on May 18, 2026, saying its in-house coding model improved on Composer 2 and posted near-parity public benchmark scores. - Terminal-Bench was the closest result: Composer 2.5 scored 69.3% versus Claude Opus 4.7 at 69.4%, according to benchmark figures circulated Tuesday. - Cursor’s changelog says Composer 2.5 is live now, while Anthropic continues offering Opus 4.7 across Claude, its API, and cloud partners.
Cursor released Composer 2.5 on May 18, 2026, and said the new version of its in-house coding model improves on Composer 2 in “intelligence and behavior.” Public benchmark figures circulated alongside the release showed Composer 2.5 running close to Anthropic’s Claude Opus 4.7 on two coding-heavy tests: 69.3% versus 69.4% on Terminal-Bench, and 79.8% versus 80.5% on SWE-Bench Multilingual. Those numbers matter because Cursor has been building Composer as a first-party model for its agent loop rather than relying only on outside frontier providers. In March, Cursor said Composer 2 scored 61.7 on Terminal-Bench and 73.7 on SWE-Bench Multilingual, which means the 2.5 release represents another step up from the prior public baseline. ### How close is “nearly tied” here? Terminal-Bench is where the gap is smallest. (cursor.com) Composer 2.5’s reported 69.3% trails Claude Opus 4.7’s 69.4% by 0.1 percentage point, according to the benchmark figures published around the release. SWE-Bench Multilingual is wider but still close, with Composer 2.5 at 79.8% and Opus 4.7 at 80.5%. Anthropic launched Claude Opus 4.7 on April 16, 2026, and described it as a stronger model for advanced software engineering and long-running agentic work. (cursor.com) Anthropic said Opus 4.7 is available across Claude products, its API, Amazon Bedrock, Google Cloud Vertex AI and Microsoft Foundry. ### What exactly did Cursor say changed in Composer 2.5? Cursor’s May 18 changelog described Composer 2.5 as “a substantial improvement in intelligence and behavior” over Composer 2. (officechai.com) The company said the model is better at sustained work on long-running tasks, follows complex instructions more reliably and is “more pleasant to collaborate with.” March’s Composer 2 materials help show the trajectory. Cursor said then that Composer 2’s gains came from continued pretraining and reinforcement learning on long-horizon coding tasks, and its technical report said the model was trained to operate in realistic Cursor sessions with the same tools and harness used in deployment. (anthropic.com) ### Why do Terminal-Bench and SWE-Bench Multilingual keep showing up? Cursor identified both tests as public software-engineering benchmarks it uses to track progress. (cursor.com) In its March Composer 2 announcement, the company said Terminal-Bench 2.0 is an agent evaluation benchmark for terminal use maintained by the Laude Institute, and said its own scores were computed with the official Harbor evaluation framework. (cursor.com) Cursor also said public benchmarks do not fully capture real developer work, which is why it built its own internal CursorBench from real coding sessions. In the technical report for Composer 2, the company said public tasks can be over-specified and narrower than the ambiguous, multi-file problems developers actually hand to coding agents. ### Does this mean Cursor has caught Anthropic? (cursor.com) The benchmark figures show Cursor running close to Anthropic on the two cited public tests, but they do not establish across-the-board superiority. Anthropic continues to position Opus 4.7 as a premium model for professional software engineering, complex agentic workflows and enterprise tasks, while Cursor is positioning Composer as a specialized first-party coding model inside its own product. (cursor.com) Pricing also differs. Cursor’s changelog lists Composer 2.5 at $0.50 per million input tokens and $2.50 per million output tokens on its standard tier, with a faster default option at $3.00 and $15.00 respectively. Anthropic lists Opus 4.7 starting at $5 per million input tokens and $25 per million output tokens. ### What should readers watch next? Cursor’s next test will be whether Composer 2.5’s benchmark gains hold up in live developer use inside Cursor’s agent workflows. (anthropic.com) Anthropic’s next reference points are likely to come from additional Opus 4.7 usage data across Claude, Bedrock, Vertex AI and Microsoft Foundry, where the model is already available. (cursor.com)