Ollama + MLX shreds Apple Silicon benchmarks

Ollama’s MLX stack is delivering massive local LLM throughput on Apple Silicon — demos show Qwen3.5‑35B hitting ~1851 tokens/s prefill and 134 tokens/s decode on M5‑class hardware with NVFP4, and authors recommend 32GB+ for coding agents. This positions local Apple Macs as serious dev machines for on‑device agent workflows. (x.com) (x.com)

Ollama posted a technical preview on March 30, 2026 that explicitly frames the MLX-backed build as a platform preview and names the release channel used for the Apple Silicon update. (ollama.com) The preview is surfaced as an Ollama 0.19 build and the blog documents concrete runtime changes such as upgraded caching behavior, “intelligent checkpoints,” and smarter eviction to reduce prompt reprocessing. (ollama.com) Apple’s MLX framework — surfaced in Apple and third‑party coverage as an open machine‑learning runtime that exposes unified‑memory and GPU neural accelerator primitives — is the runtime Ollama targets to reduce memory movement on recent Apple silicon. (appleinsider.com) Community tooling has moved quickly: a GitHub bridge (dpalmqvist/mlx_ollama) already translates Ollama API calls into mlx_lm.generate()/mlx_lm.stream_generate() for MLX execution, showing ecosystem momentum for integrations. (github.com) Independent benchmark roundups and lab posts compare MLX‑based stacks (vllm‑mlx and MLX backends) against longstanding macOS inference paths, highlighting that KV‑cache handling and unified‑memory bandwidth materially change multi‑request throughput across M‑class chips. (macgpu.com) Coverage from Ars Technica, MacRumors and 9to5Mac on March 31, 2026 emphasized the MLX integration, NVFP4 model format support for production parity, and Ollama’s stated roadmap to add other precisions and future model support. (arstechnica.com)

Ollama + MLX shreds Apple Silicon benchmarks

Get your own daily briefing