MoE models hit Apple Silicon limits

Engineers are flagging hard limits when mapping mixture‑of‑experts (MoE) models to Apple Silicon caches and accelerators, highlighting nontrivial tuning and architecture mismatches. Those optimization hurdles suggest dense, hardware‑aware model design remains critical for running advanced models on unified memory architectures (x.com).

Dan Woods’ flash-moe build ran a Mixture‑of‑Experts Qwen3.5‑397B variant at better than 5.5 tokens/second on a 48GB MacBook Pro M3 Max while streaming expert weights from NVMe; the full model occupies ~209GB on disk and ~120GB when quantized. (simonwillison.net)(github.com) The flash-moe implementation relies on the OS page cache as the caching layer and streams only the K=4 active experts per layer (each ≈6.75MB), using parallel pread() calls coordinated by Grand Central Dispatch rather than a custom cache manager. (github.com) Apple’s “LLM in a Flash” paper directs model authors to reduce flash transfer volume and favor larger, contiguous reads — explicitly prioritizing flash/memory‑management optimizations over raw compute tuning for large‑model inference on devices. (machinelearning.apple.com)(arxiv.org) The mlx‑od‑moe community port treats NVMe as an L3 cache on Apple Silicon and reports delivering 70+ tokens/second for a ~375GB MoE model using a 192GB‑RAM host by loading experts on demand from disk. (github.com) Benchmarks and experiments surface practical ceilings imposed by Apple’s unified‑memory and macOS policies — for example, public analyses note the M1 Ultra’s effective GPU “VRAM” usage sits well below total physical memory (≈96GB practical cap reported) and research papers call out significant memory‑management overhead in multi‑node MoE setups on Apple stacks. (stencel.io)(arxiv.org) Academic and open‑source followups are converging on SSD/I‑O optimizations for MoE; proposals such as FlashMoE study ML‑driven cache‑replacement and other strategies to reduce SSD I/O bottlenecks for edge MoE inference. (dblp.org) Multiple community implementations explicitly credit Apple’s LLM‑in‑a‑Flash techniques as the foundation for on‑demand expert streaming and quantify the tradeoffs between SSD throughput, OS page cache behavior, and accelerator offload when mapping MoE to Apple Silicon. (github.com)(machinelearning.apple.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.