M5 Mac ML demos accelerate
Developers posted new M5 ML wins: Rapid‑MLX hitting ~108 tok/s on 9B models via DeltaNet snapshots, demos of 397B models at 10+ tok/s on M5 Max, and menu‑bar inference tools using SSD KV‑offload—signals that Apple Silicon tooling for edge models is moving fast. Apple’s MPP matmul APIs were also highlighted as enabling tighter hardware‑software co‑design for on‑device ML. (x.com) (x.com) (x.com) (x.com)
Rapid‑MLX’s public repo documents use of DeltaNet “state snapshots” plus speculative decoding and prompt caching to accelerate Apple‑silicon inference and reports a head‑to‑head Qwen3.5‑9B number of 79 tok/s vs Ollama’s 33 tok/s in its README.. The Rapid‑MLX project explicitly says DeltaNet snapshots enable prompt‑cache‑style speedups on hybrid RNN/DeltaNet architectures by storing intermediate state, and the repo lists speculative decoding and tool‑logits biasing as complementary optimizations.. Apple’s MLX research post shows MLX using M5’s Neural Accelerators and reports time‑to‑first‑token (TTFT) under 10 seconds for a dense 14B model and under 3 seconds for a 30B model on M5 hardware.. Flash‑MoE’s C/Metal engine demonstrates Qwen3.5‑397B running on a MacBook Pro with 48 GB RAM at about 4.4 tokens/sec while streaming the full 209 GB model from SSD via a custom Metal pipeline.. oMLX’s GitHub describes a menu‑bar LLM server that exposes continuous batching and tiered SSD KV caching for macOS, and community coverage highlights the same SSD‑backed KV offload approach for multi‑model local serving.. Apple’s MPP (Metal Performance Primitives) documentation includes a dedicated GEMM/matmul chapter and mpp::tensor_ops primitives for building threadgroup‑ and SIMD‑scoped matrix kernels, guidance that developers cite when hand‑tuning matmul for Apple GPUs.. M5 Max hardware specs being used in community benchmarks include configurations with up to 128 GB unified memory and measured bandwidth figures around 614 GB/s in early tests, while Qwen3.5’s MoE design only activates roughly 17B parameters per token—both facts the community cites when explaining how very large models can be streamed or pruned to run on Mac hardware..