New model bench gains and open-source moves

Recent social posts report that Alibaba's Qwen 3.6+ leads SWE-focused benchmarks (SWE-Bench Pro 56.6%, Terminal-Bench 61.6%) while MiniMax M2.7 was open-sourced and posts top-coder numbers; Epoch AI's MirrorCode benchmark also tracked Claude 4.6 reimplementing a 16k-line bioinformatics toolkit. These posts highlight both capability gains and the emergence of open-source contenders on coding tasks. (x.com)(x.com)(x.com)

Coding benchmarks shifted again in early April, with Alibaba’s Qwen3.6-Plus posting 61.6% on Terminal-Bench 2.0 and 56.6% on SWE-Bench Pro. (qwen.ai) Those tests measure two different kinds of software work: Terminal-Bench asks models to complete tasks in a live command-line environment, while SWE-Bench Pro scores whether they can solve harder, longer-horizon repository issues. Scale AI says SWE-Bench Pro was built to reduce contamination and to reflect more realistic enterprise-style software problems. (tbench.ai) (labs.scale.com) Alibaba published Qwen3.6-Plus on April 1 and said the model is hosted through Alibaba Cloud Model Studio with a 1 million-token context window. In Alibaba’s table, Claude Opus 4.5 still led on SWE-Bench Verified at 80.9% to Qwen’s 78.8%, even as Qwen led Terminal-Bench 2.0. (qwen.ai) (modelstudio.alibabacloud.com) A second move came from MiniMax, which released M2.7 on March 18 and said it scored 56.22% on SWE-Pro, 57.0% on Terminal Bench 2, and 1495 Elo on GDPval-AA. On April 11, Nvidia said the open-weights release was available through Nvidia and the broader open-source inference ecosystem. (minimax.io) (developer.nvidia.com) That changed the mix in coding models because Qwen’s strongest numbers came from a hosted product, while MiniMax put weights on Hugging Face and GitHub. MiniMax’s Hugging Face page describes M2.7 as an open model and repeats the 1495 GDPval-AA score and 97% skill-compliance figure. (huggingface.co) (github.com) Another benchmark released on April 10 focused less on bug fixing and more on rebuilding software from behavior alone. Epoch AI and Model Evaluation and Threat Research said MirrorCode gives agents execute-only access to a program, visible tests, and documentation, but no source code or internet access. (epoch.ai) (metr.org) In that setup, Epoch AI said Claude Opus 4.6 fully reimplemented gotree, a Go bioinformatics command-line toolkit with about 16,000 lines of code and more than 40 commands. Epoch AI said a human engineer might need 2 to 17 weeks for a comparable task, while also warning that oracle-style tests and imperfect memorization defenses limit how directly the result maps to everyday software jobs. (epoch.ai) (metr.org) The benchmark owners are also drawing lines around what the numbers mean. Scale says SWE-Bench Pro was designed around contamination, task diversity, ambiguous real-world issues, and reproducible testing, while Epoch says MirrorCode still uses a narrow specification and a private test set is being held back as the benchmark is open-sourced. (labs.scale.com) (epoch.ai) Taken together, the April releases show three separate tracks moving at once: stronger hosted coding agents, stronger open-weight rivals, and newer tests aimed at longer software tasks than a single patch. The next comparison will depend less on one headline score than on which models can keep reproducing these results across public leaderboards and held-out evaluations. (qwen.ai) (huggingface.co) (epoch.ai)

New model bench gains and open-source moves

Get your own daily briefing