vllm-mlx speeds LLMs on Apple Silicon
- Wayner Barrios’s vllm-mlx project gained traction as an Apple Silicon inference server that exposes OpenAI and Anthropic APIs from one MLX-based process. - The project says it adds continuous batching, paged key-value cache, prefix caching and multimodal serving for Llama, Qwen-VL, audio and embeddings. - An accompanying January paper reported 21% to 87% higher text throughput than llama.cpp on Apple hardware. (arxiv.org)
Large language model serving is the software layer that keeps an artificial intelligence model answering requests; vllm-mlx aims to do that natively on Apple Silicon Macs. (github.com) Wayner Barrios describes vllm-mlx as a vLLM-style server built on Apple’s MLX stack, with OpenAI-compatible `/v1/*` endpoints and Anthropic-compatible `/v1/messages` in the same process. (github.com) That means a Mac can present itself like a hosted model service while running models locally on Metal, Apple’s graphics and compute layer, with unified memory shared across central processor and graphics processor. (github.com) (arxiv.org) The performance trick is continuous batching, which groups incoming prompts so the model stays busy instead of idling between users. The repository also advertises paged key-value cache, prefix caching and solid-state-drive cache tiering. (github.com) Those caches matter because chat models repeatedly reuse old context. In the January 27, 2026 paper, Barrios wrote that vllm-mlx scaled to 4.3 times aggregate throughput at 16 concurrent requests on text workloads. (arxiv.org) The same paper reported 21% to 87% higher text throughput than llama.cpp across models from Qwen3-0.6B to Nemotron-30B on Apple Silicon. It also reported up to 525 tokens per second on text models using an Apple M4 Max. (arxiv.org) Multimodal models add another bottleneck because they re-encode the same image on every turn. The paper says vllm-mlx uses content-based prefix caching so identical images can skip that work, cutting one Qwen3-VL-30B repeated-image workload from 21.7 seconds to under 1 second. (arxiv.org) (github.com) The project’s latest GitHub release, v0.2.9, was posted in April 2026 and added an OpenAI-compatible `/v1/responses` endpoint, Prometheus metrics and a benchmarking command called `bench-serve`. The same release also bundled a long list of server hardening changes, including local-only binding by default and stricter handling of remote media and tool execution. (github.com) That leaves the Apple Silicon pitch looking more like local infrastructure than a hobby app: one Mac, one API surface, and support for text, vision, audio and embeddings. The open question is how far that approach scales beyond a single machine, where separate projects such as dnet are trying to spread model inference across multiple Apple devices. (github.com 1) (github.com 2)