vLLM supports MiniMax M2.7

vLLM announced day‑0 support for MiniMax’s newly open‑sourced M2.7 model, which MiniMax highlighted for strong coding and agentic capabilities and competitive benchmark scores. The two posts surfaced together, signalling immediate tooling support for the open model in inference pipelines. ( )

vLLM said it added same-day support for MiniMax M2.7, putting the newly released open model straight into a widely used serving stack for running large language models in production. (x.com) MiniMax published M2.7 as an open model on GitHub and described it as its first model “deeply participating in its own evolution,” with support for coding, tool use, and multi-step agent workflows. (github.com, minimax.io) In plain terms, vLLM is the software layer many teams use to serve a model behind an application programming interface, manage graphics processing unit memory, and handle many user requests at once. vLLM’s latest stable release is version 0.19.0, published April 10, 2026, and its documentation lists native and Transformers-based support paths for model families like MiniMax. (github.com, docs.vllm.ai) MiniMax’s own materials pitch M2.7 as a sparse mixture-of-experts model, a design that keeps a very large model on disk but activates only a smaller slice for each token. NVIDIA said M2.7 has 230 billion total parameters, 10 billion active parameters per token, 256 experts, and a 200,000-token context window. (developer.nvidia.com) MiniMax said M2.7 scored 56.22% on SWE-Pro, 57.0% on Terminal Bench 2, 55.6% on VIBE-Pro, and 66.6% medal rate on MLE Bench Lite. Those are benchmark names for software engineering, terminal-based task solving, coding project delivery, and machine learning competition tasks, and they are central to MiniMax’s case that M2.7 is aimed at coding and agent use rather than general chat alone. (github.com, minimax.io) The vLLM recipe for the MiniMax M2 series already shows deployment settings for MiniMax-specific tool calling and reasoning parsers, which are the formatting rules that let a serving engine interpret a model’s tool-use and chain-of-thought style outputs. The guide says the examples use MiniMax-M2.5, but it also says users can switch the model name among MiniMaxAI/MiniMax-M2.5, MiniMaxAI/MiniMax-M2.1, and MiniMaxAI/MiniMax-M2 during deployment, and the Ascend guide explicitly covers MiniMax-M2.7. (docs.vllm.ai, docs.vllm.ai) That immediate support matters because open-weight releases often arrive before the surrounding tooling is ready, forcing developers to wait for patches, custom loaders, or unofficial forks. Here, MiniMax’s release landed alongside official documentation from vLLM, NVIDIA, and other inference platforms, cutting the time between announcement and deployment. (docs.vllm.ai, developer.nvidia.com, ollama.com) The hardware demands are still large. vLLM’s recipe says MiniMax M2-series deployments need about 220 gigabytes for weights and recommends four 96-gigabyte graphics processing units for about 400,000 tokens of total key-value cache, while Unsloth says the unquantized bfloat16 M2.7 checkpoint requires 457 gigabytes. (docs.vllm.ai, unsloth.ai) MiniMax has also tied M2.7 to a broader “self-evolution” story, saying an internal version of the model optimized a programming scaffold over more than 100 rounds and produced a 30% performance improvement. That claim comes from MiniMax’s own materials, not an independent audit, and the public release will now give outside developers a chance to test how much of that benchmark and agent performance holds up in practice. (github.com, minimax.io) The net result is simple: MiniMax opened M2.7, and vLLM made it runnable on day one, which is often the difference between a model announcement and a model people actually deploy. (x.com, github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.