400B model on a 48GB MacBook

A dev posted an experiment running a ~400B‑parameter model locally on a 48GB MacBook using RAM + SSD streaming, achieving roughly 1 token/sec — a proof‑of‑concept for pushing massive models to personal hardware. The experiment cites techniques inspired by Apple’s 'LLM in a Flash' work and shows storage‑backed strategies scaling edge compute (x.com).

The public repo "flash-moe" (author danveloper) documents running the Qwen3.5-397B-A17B Mixture‑of‑Experts model and streaming the full 209GB on‑disk weight set through a MacBook pipeline. (github.com) Benchmarks in that repo report 4.36 tokens/sec using a 4‑bit experts FMA kernel and list full tool‑calling output as part of the evaluation. (github.com) The implementation is written in C/Objective‑C with hand‑tuned Metal shaders and a custom Metal compute pipeline that streams parameters from SSD into DRAM slices — explicitly “no Python, no frameworks” in the repo notes. (github.com) Qwen3.5‑397B is a MoE model, so inference requires routing and loading expert parameter sets on demand, which increases peak storage and IO complexity compared with dense models; the repo documents per‑expert streaming strategies. (github.com) Apple’s "LLM in a Flash" paper (authors include Karen Khatamifard, Minsik Cho, Carlo C. Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar) describes windowing and row‑column bundling to minimize flash→DRAM transfers and optimize throughput for flash‑backed inference. (machinelearning.apple.com) The experiment’s engineering choices — contiguous large reads from SSD, compute offload to Metal, and quantized expert kernels — track the cost model and transfer‑minimization techniques Apple recommends for DRAM‑limited devices, demonstrating a concrete engineering path for storage‑backed LLM inference on Apple Silicon. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.