Ollama supercharges Apple Silicon
Ollama’s new MLX backend is dramatically speeding local LLM inference on Apple Silicon by using unified memory and optimized quantization—benchmarks show M5 chips hitting ~1,851 tokens/s prefill and 134 tokens/s decode in early tests. That turns Macs into viable local AI servers with persistent KV caching and OpenAI‑compatible APIs, making privacy‑first on‑device agents more practical for production workflows. (appleinsider.com)
Ollama published a preview post for version 0.19 on March 30, 2026 and says its March 29 test used Alibaba’s Qwen3.5-35B-A3B quantized to NVFP4 and compared against a Q4_K_M baseline. (ollama.com) Ollama’s own charts imply roughly a 57% prefill and a 93% decode speed uplift versus 0.18 when measured on the March 29 benchmark data. (ollama.com) The 0.19 preview adds NVFP4 support to match NVIDIA-optimized inference pipelines and states NVFP4 preserves higher model quality while lowering bandwidth and storage for inference. (ollama.com) MLX is Apple’s open-source array framework tuned for unified memory and the new GPU Neural Accelerators (announced via WWDC 2025 and hosted on Apple Open Source), which Ollama leverages for zero-copy handoffs and accelerated matrix work on M5-class chips. (github.com) Ollama 0.19 upgrades its cache subsystem to reuse KV state across conversations, add “intelligent checkpoints,” and extend eviction policies to keep shared prefixes longer, per the 0.19 blog notes. (ollama.com) K/V cache quantization has been discussed in the community and can be toggled in practice via environment options documented by contributors (for example OLLAMA_KV_CACHE_TYPE), enabling lower memory footprints for extended contexts. (mitjamartini.com) The Ollama project added OpenAI Chat Completions API compatibility on February 8, 2024, and MacRumors notes the 0.19 preview download requires a Mac with more than 32GB of unified memory and currently focuses initial acceleration on Qwen3.5. (ollama.com)