vLLM Adds Mistral Support

vLLM added day‑0 support for Mistral Small 4—a 119B Mixture‑of‑Experts model with 256K context and multimodal capabilities—making it straightforward to serve large reasoning and coding models in production stacks announced. That expands model choices for unified instruct, reasoning, and code assistance in enterprise deployments.

vLLM’s Mixture‑of‑Experts runtime uses a FusedMoE layer that performs expert routing, parallel execution, and result aggregation inside the model executor. (docs.vllm.ai) Mistral Small 4 ships with 119B parameters, 128 experts, 4 active experts per token, and a 256K context window in the published specification. (mistral.ai) Per‑token activations for Small 4 are reported at roughly 6–6.5B active parameters, a characteristic that drives higher peak activation memory and cross‑GPU communication on inference nodes. (baristalabs.io) vLLM’s runtime provides features targeting large‑context and MoE workloads—its “chunked prefill” plus prioritized decode batching reduces memory pressure during long prefills, while documented attention/paging optimizations lower GPU memory footprints for very large contexts. (docs.vllm.ai) Community and vendor playbooks for vLLM recommend combining tensor, data, pipeline, and expert parallelism (TP/DP/PP/EP) to distribute 128 experts across GPUs and to control per‑GPU memory and interconnect bandwidth in production MoE deployments. (rocm.blogs.amd.com) Mistral’s deployment docs list vLLM as a recommended self‑deploy target with concrete model‑format and tokenizer guidance for loading weights from Hugging Face, and the Hugging Face model page for Small 4 reports up to a 40% end‑to‑end latency reduction and up to 3x throughput in throughput‑optimized setups versus the prior Small 3 baseline. (docs.mistral.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.