Opus 4.7 posts coding gains

Benchmarks from an Opus 4.7 model release show notable improvements on coding evaluation suites—about +11% on SWE‑Bench Pro and +7% on Verified—indicating recent progress in AI coding assistance performance. The numbers were shared on social media alongside commentary about how these models affect developer workflows. (x.com)

Software engineering benchmarks are tests where an AI has to fix real bugs in real codebases, not just answer coding questions. Anthropic’s Claude Opus 4.7 posted higher scores on two of the field’s best-known suites when it launched on April 16. (anthropic.com) Anthropic said Opus 4.7 reached 64.3% on SWE-Bench Pro and 87.6% on SWE-Bench Verified. That is up from 53.4% and 80.8% for Opus 4.6, a gain of 10.9 and 6.8 percentage points. (anthropic.com) SWE-Bench Verified is a 500-task, human-validated subset of the original SWE-Bench benchmark. SWE-Bench Pro is a newer 1,865-task benchmark built to be harder and more resistant to training-data leakage, with Scale AI reporting sub-25% scores for widely used models under its own unified scaffold. (openai.com, scale.com, scaleapi.github.io) Those numbers are part of a race to measure “agentic” coding, where a model reads a bug report, inspects files, edits code, runs tests, and tries again. Anthropic said Opus 4.7 is tuned for long-running software tasks, automations, and continuous integration and delivery workflows rather than short code snippets alone. (anthropic.com, anthropic.com) Anthropic kept Opus 4.7 at the same listed price as Opus 4.6: $5 per million input tokens and $25 per million output tokens. The company also said the model is generally available through the Claude API, and GitHub said it began rolling out Opus 4.7 in GitHub Copilot on April 16. (anthropic.com, support.claude.com, github.blog) Benchmark scores in coding are heavily shaped by the test setup, including the scaffold, tool access, and how many attempts a model gets. SWE-Bench’s own site lets users compare scores by agent setup, and Scale’s Pro benchmark reports much lower absolute scores under its standardized framework than vendor-run evaluations often show. (swebench.com, scaleapi.github.io) Anthropic paired the coding claims with product changes aimed at longer jobs. Its release said Opus 4.7 adds an “xhigh” effort setting, higher-resolution vision, and stronger file-system memory across multi-session agent work. (anthropic.com, anthropic.com) The company also drew a line between Opus 4.7 and its more restricted Mythos Preview model. Anthropic told CNBC that Opus 4.7 is its most powerful generally available model, but “less broadly capable” than Mythos Preview, which remains more tightly controlled. (cnbc.com) For developers, the immediate question is not whether 64.3% or 87.6% is the final word. It is whether the newer model misses fewer steps on real tickets, and Anthropic, GitHub, and users posting early reactions are all framing Opus 4.7 as a model built to stay on task longer before handing work back. (anthropic.com, github.blog, x.com)

Opus 4.7 posts coding gains

Get your own daily briefing