Docker Brings High-Throughput LLMs to Mac
In a major win for local AI development, Docker Model Runner now supports vllm-metal, a backend for high-throughput LLM inference on Apple Silicon. This allows developers to serve MLX models on a Mac with the same efficiency as on an Nvidia-based Linux system, reinforcing the Mac as a primary platform for professional ML workflows.
The vllm-metal backend is a plugin for the high-throughput inference engine vLLM, developed collaboratively by Docker and the vLLM project. It is designed to bring high-performance LLM inference to Apple Silicon by unifying Apple's MLX framework and PyTorch, allowing it to plug directly into vLLM's existing engine and scheduler. This project, including contributions from Lik Xun Yuan, Ricky Chen, and Ranran Haoran Zhang, has been open-sourced and contributed to the vLLM community. This integration leverages Apple Silicon's unified memory architecture, enabling zero-copy tensor operations where the GPU can directly access system RAM. This avoids the data transfer bottleneck between separate CPU and GPU memory pools common in other systems. Paired with optimizations like PagedAttention for KV cache management, it allows for serving longer sequences with less memory. Apple's MLX, a framework for machine learning on Apple silicon, provides the foundation for vllm-metal. Released in late 2023 by Apple's machine learning research, MLX features a NumPy-like Python API, lazy computation, and support for running operations on both the CPU and GPU without data transfers. The framework is designed for efficiency and flexibility on the unified memory of M-series chips. The result significantly lowers the cost of entry for high-throughput LLM development, which has traditionally required expensive NVIDIA GPUs like the RTX 4090 or A100/H100 cards. Now, a base Mac Mini can serve as a viable development environment that mirrors production setups using the same OpenAI-compatible API. This aligns with a broader trend of on-device AI, where models run locally for improved privacy, latency, and offline capability. While benchmarks indicate that llama.cpp is approximately 1.2 times faster than vllm-metal for certain configurations, the vllm-metal integration represents a strategic move to establish the Mac as a primary platform for professional ML workflows. An independent project, vllm-mlx, has shown throughput gains of 21% to 87% over llama.cpp, highlighting the performance potential of combining vLLM's serving capabilities with the MLX framework on Apple hardware.