Rapid‑MLX beats Ollama on Apple silicon
- Rapid-MLX, a new local AI server for Macs, is gaining attention after posting Apple-silicon benchmarks that show it outrunning Ollama head-to-head. - The headline claim is 4.2× faster performance than Ollama, plus 0.08-second cached time-to-first-token, with about 1.2k GitHub stars already. - It matters because Ollama just switched to MLX too, so Rapid-MLX now has to prove engine design—not framework choice—is the real edge.
Local AI on Macs is turning into an engine fight. Not a model fight — an engine fight. The models are often the same. The hardware is the same. What changes is the software layer that loads weights, manages memory, caches prompts, and feeds tokens back fast enough to feel instant. That is why Rapid-MLX is getting attention right now: it says that layer can make a huge difference on Apple silicon, even against Ollama, the default local stack for a lot of Mac users. ### What is Rapid-MLX? Rapid-MLX is an open-source inference server built for Apple silicon. It presents itself as a drop-in OpenAI-compatible local backend, so tools like Cursor, Claude Code, Aider, and other coding assistants can point at it without a full integration rewrite. The repo is active, the PyPI package is shipping, and the GitHub project is sitting around 1.2k stars as of May 5, 2026. ### What is the big claim? The project’s own README says Rapid-MLX is “2-4x faster” than Ollama on Apple silicon, with one headline benchmark framed as 4.2× faster than Ollama and cached TTFT of 0.08 seconds. It also claims full tool-calling support, 17 tool parsers, prompt caching, and routing features that make it more than just a barebones benchmark toy. The important caveat is that there isn't independent consensus. ### Why is Apple silicon such a special case? Because Apple’s stack is weird in a good way. MLX is designed around unified memory, so the CPU and GPU are not constantly copying giant model tensors back and forth. Apple has also been adding more specialized acceleration paths — including the M5 family’s GPU Neural Accelerators for matrix-heavy inference work. That means the Mac is no longer a separate memory-and-accelerator system, and inference engines that fit that shape better can pull ahead fast. ### Didn’t Ollama already get much faster? Yes — and that is what makes this interesting. On March 30, 2026, Ollama announced a preview backend powered by MLX on Apple silicon. In its own testing on Qwen3.5-35B-A3B, Ollama showed prefill rising from 1154 to 1810 tokens per second and decode from 58 to 112 tokens per second on M5 hardware, with even higher numbers in int4 mode. So Rapid-MLX is not beaten; it is claiming an edge