vLLM adds MiniMax M2.7 support
vLLM announced day‑one support for the newly open‑sourced MiniMax M2.7 model, which is described as agent‑oriented for multi‑agent orchestration and coding tasks. A demo showed MiniMax M2.7 running in BF16 on 4× DGX Sparks at a 200k context window, with follow‑up benchmark tweets reporting prefill and decode throughput numbers for production scaling. (x.com/vllm_project/status/2043133899593920877) (x.com/TheAhmadOsman/status/2043454763967152357) (x.com/TheAhmadOsman/status/2043517424528433385)
vLLM said it added same-day support for MiniMax M2.7, letting developers serve the new model through an inference engine already used in production. (docs.vllm.ai) Inference engines are the software layer that keeps large models running efficiently on chips, handling memory, batching, and token generation so one model can serve many requests at once. vLLM says its stack is built around techniques such as paged attention, continuous batching, and an OpenAI-compatible server interface. (pypi.org) MiniMax M2.7 is a sparse mixture-of-experts model, a design that keeps many parameters on disk but activates only a small slice for each token. NVIDIA’s technical write-up lists M2.7 at 230 billion total parameters, 10 billion active parameters per token, 256 experts, and a 200,000-token context window. (developer.nvidia.com) MiniMax’s GitHub page says M2.7 is aimed at “agent teams,” coding, and tool use, with benchmark claims including 56.22% on SWE-Pro and 57.0% on Terminal Bench 2. The company also says an internal version of the model improved a programming scaffold over more than 100 rounds during development. (github.com) That pairing matters because a model release without serving support can sit idle for days or weeks while infrastructure teams add kernels, parsers, and deployment recipes. NVIDIA said it worked with the open-source inference ecosystem, including vLLM and SGLang, to add optimizations for the MiniMax M2 family. (developer.nvidia.com) vLLM’s published recipe shows what “support” means in practice: a Docker launch path, tensor parallel settings, tool-call parsing for MiniMax M2, and a reasoning parser tuned for the model family. The same guide lists memory needs of about 220 gigabytes for weights and 240 gigabytes per 1 million context tokens, with a note that individual sequences top out around 196,000 tokens in the documented setup. (docs.vllm.ai) MiniMax and NVIDIA have both framed M2.7 as an “open” release, but the license is narrower than the term usually implies in software. The GitHub license says commercial use is prohibited without prior written authorization from MiniMax and requires the label “Built with MiniMax M2.7” for commercial deployments. (github.com) That leaves vLLM’s support as a technical green light, not a blanket business green light. Developers can now run M2.7 more easily on supported hardware, but companies still need to read the license before turning that same-day integration into a product. (docs.vllm.ai)