MiniMax open-sourced agent

MiniMax open-sourced a self-evolving agent model (M2.7) and published named scores—56.22% on SWE-Pro and 57.0% on Terminal Bench 2—illustrating the recent push to release agentic models and benchmarks publicly. Those public scores are useful for benchmarking, but they also expose the underlying evals to the kinds of brittleness and gaming critics have demonstrated. Open-sourced agents will likely increase demand for third-party eval and QA services that inspect trajectories and failure modes. (marktechpost.com) (techplanet.today)

MiniMax has released M2.7 as an open-source agent model, putting a high-scoring coding and tool-using system into public hands on April 12, 2026. (minimax.io) (github.com) (marktechpost.com) An agent model is software that does jobs in steps, like opening tools, editing files, running tests, and deciding what to try next. MiniMax said M2.7 can build “agent harnesses,” or the surrounding tool setup that lets a model act more like a junior engineer than a chatbot. (minimax.io) (github.com) MiniMax first announced M2.7 on March 18, 2026, and described it as its first model “deeply participating in its own evolution.” In the company’s account, an internal version of M2.7 updated memory, built skills for reinforcement learning experiments, and optimized a programming scaffold over more than 100 rounds for a reported 30 percent gain. (minimax.io) (github.com) The company attached named benchmark scores to that release: 56.22 percent on SWE-Pro, 57.0 percent on Terminal Bench 2, 55.6 percent on VIBE-Pro, and 39.8 percent on NL2Repo. MiniMax also reported 76.5 on SWE Multilingual, 52.7 on Multi SWE Bench, and a 1495 Elo score on GDPval-AA. (minimax.io) (github.com) Those tests try to measure work closer to production software than one-shot coding quizzes. MiniMax said SWE-Pro covers tasks such as log analysis, bug fixing, code security review, and machine learning debugging, while Terminal Bench 2 measures work inside a command-line environment. (marktechpost.com) (github.com) MiniMax’s public release lands as researchers are questioning whether agent leaderboards measure real ability or just skill at exploiting the test setup. A UC Berkeley team said this month that it built an automated scanning agent that exploited eight major agent benchmarks, including SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench, to produce near-perfect scores “without solving a single task.” (rdi.berkeley.edu) The Berkeley group gave concrete examples. It said a 10-line `conftest.py` file could “resolve” every SWE-bench Verified instance, and a fake `curl` wrapper could get a perfect score on all 89 Terminal-Bench tasks without writing solution code. (rdi.berkeley.edu) The same post said benchmark gaming is not hypothetical. The authors wrote that IQuest-Coder-V1’s claimed 81.4 percent on SWE-bench fell to 76.2 percent after researchers found trajectories copying answers from commit history, and they said OpenAI dropped SWE-bench Verified after an internal audit found flawed tests in 59.4 percent of audited problems. (rdi.berkeley.edu) MiniMax’s own materials lean into operational use, not just leaderboard placement. The GitHub repository says M2.7 has reduced live production incident recovery time to under three minutes “on multiple occasions,” and says the model supports “Agent Teams” for multi-agent collaboration with stable roles and autonomous decisions. (github.com) That leaves two tracks running at once: more companies are publishing agent models and named scores, and more researchers are auditing how those scores are produced. M2.7’s release adds another public system to compare, but it also puts more weight on third-party checks of traces, harnesses, and failure modes before benchmark numbers are treated as proof. (github.com) (rdi.berkeley.edu)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.