400B model on a 48GB MacBook

A dev posted an experiment running a ~400B‑parameter model locally on a 48GB MacBook using RAM + SSD streaming, achieving roughly 1 token/sec — a proof‑of‑concept for pushing massive models to personal hardware. The experiment cites techniques inspired by Apple’s 'LLM in a Flash' work and shows storage‑backed strategies scaling edge compute (x.com).

The public repo "flash-moe" (author danveloper) documents running the Qwen3.5-397B-A17B Mixture‑of‑Experts model and streaming the full 209GB on‑disk weight set through a MacBook pipeline. (github.com) Benchmarks in that repo report 4.36 tokens/sec using a 4‑bit experts FMA kernel and list full tool‑calling output as part of the evaluation. (github.com) The implementation is written in C/Objective‑C with hand‑tuned Metal shaders and a custom Metal compute pipeline that streams parameters from SSD into DRAM slices — explicitly “no Python, no frameworks” in the repo notes. (github.com) Qwen3.5‑397B is a MoE model, so inference requires routing and loading expert parameter sets on demand, which increases peak storage and IO complexity compared with dense models; the repo documents per‑expert streaming strategies. (github.com) Apple’s "LLM in a Flash" paper (authors include Karen Khatamifard, Minsik Cho, Carlo C. Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar) describes windowing and row‑column bundling to minimize flash→DRAM transfers and optimize throughput for flash‑backed inference. (machinelearning.apple.com) The experiment’s engineering choices — contiguous large reads from SSD, compute offload to Metal, and quantized expert kernels — track the cost model and transfer‑minimization techniques Apple recommends for DRAM‑limited devices, demonstrating a concrete engineering path for storage‑backed LLM inference on Apple Silicon. (github.com)

400B model on a 48GB MacBook

Get your own daily briefing