Agentic coding reports and benchmarks
Claude Opus 4.7 reportedly scored 64.3% on SWE‑bench Pro and 87.6% on a verified measure, and Modular published a report showing frontier coding agents rebuilding a text‑to‑video pipeline from scratch. (x.com 1) (x.com 2) These postings highlight improvements in models’ ability to handle medium‑scope software tasks and to be used as agentic building blocks in pipelines. (x.com)
A coding benchmark is a test where an artificial intelligence system gets a real bug report, edits a real codebase, and passes the tests that prove the fix works. New reports this week said those scores are rising on harder software tasks. (swebench.com) (anthropic.com) SWE-bench uses GitHub issues and their actual fixes to measure whether a model can produce a patch that resolves the problem. The benchmark runs the repository’s tests in Docker and counts how many issues were actually fixed. (swebench.com) (openai.com) The main public variants differ in difficulty and curation. SWE-bench Verified is a 500-task subset that engineers checked as solvable, while the original full set has 2,294 instances. (swebench.com 1) (swebench.com 2) SWE-bench Pro is a separate long-horizon benchmark built to test harder software engineering work than the original benchmark. Its GitHub page describes it as a dataset where a model gets a codebase and issue, then must generate a patch that resolves the problem. (github.com) Anthropic said on April 16 that Claude Opus 4.7 scored 64.3% on SWE-bench Pro and 87.6% on SWE-bench Verified. Anthropic’s release also said the model is available across Claude products and its application programming interface at the same listed price as Opus 4.6. (anthropic.com) The public SWE-bench Verified leaderboard showed earlier top entries in February at 76.8% for Claude 4.5 Opus and 75.6% for Claude Opus 4.6 under the mini-SWE-agent harness. Anthropic’s 87.6% figure is a reported result, not a score currently shown on that public leaderboard page. (swebench.com) (anthropic.com) A separate report from Modular moved from benchmark scores to a concrete engineering task. In a post published April 16, Modular said it asked five frontier models to rebuild the full Wan 2.1 text-to-video inference pipeline on its MAX stack in 20 hours, without PyTorch or diffusers in the final submission. (modular.com) Text-to-video is software that turns a written prompt into a short generated clip. Modular said Wan 2.1 is a 1.3 billion-parameter video diffusion model, and that the task required rebuilding text encoding, a 30-layer denoiser, video decoding, and scheduling into one working pipeline. (modular.com) Modular said two of the five agents produced a working MAX pipeline. The company said hidden workloads had to clear a 25 decibel peak signal-to-noise ratio threshold before speed was measured, and that final submissions could not rely on PyTorch, vLLM, transformers, or diffusers. (modular.com) The through line in both reports is that the tests are shifting from short code edits toward multi-step jobs that span several subsystems. SWE-bench’s own history shows that early baselines scored 1.96% in October 2023, and its first agent-based system reached 12.47% before newer agent setups pushed much higher. (swebench.com 1) (swebench.com 2) The next question is whether more of these reported gains show up in shared leaderboards and outside company-run case studies. For now, the newest numbers point to models spending less time on toy fixes and more time on medium-scope software work that has to run end to end. (swebench.com) (modular.com)