397B model runs on one MacBook Pro
A demo streamed 209GB of weights and ran a 397B‑parameter model on a single MacBook Pro using Flash‑MoE and hand‑tuned Metal shaders, hitting ~4.4 tokens/sec and leveraging pure C + Metal without Python frameworks. It took 58 experiments to get there — a vivid proof that extreme weight‑streaming and MoE tricks can push large models onto consumer hardware. (x.com)
The code and technical writeup live in a public GitHub repository maintained by the user danveloper (Dan Woods) and the README links to a paper plus "90+ experiments" that document the build and benchmarking process. (github.com) The repository targets the Qwen3.5-397B-A17B model and documents the transformer as 60 layers composed of 45 GatedDeltaNet layers and 15 full-attention layers, with 512 experts per layer, K=4 experts activated per token, and a hidden dimension of 4096. (github.com) I/O is implemented by streaming quantized expert weights from NVMe using parallel pread() calls and deliberately "trusting the OS" page cache so only the active experts (~6.75 MB each) are loaded on demand. (github.com) Performance-critical kernels are hand-written Metal compute shaders and include an FMA-optimized dequantization kernel that rewrites the math to let the GPU fused-multiply-add unit do dequant+multiply in one instruction, producing about a 12% inner-loop speedup over the naive formulation. (github.com) The project’s results table highlights quantization trade-offs: 4‑bit experts are marked as the production configuration for reliable JSON/tool-calling, while 2‑bit repacking shrinks on-disk expert size to roughly 120 GB and raises peak throughput at the cost of unstable tool outputs. (github.com) Hardware telemetry in the repo lists an Apple M3 Max–class configuration with 48 GB unified memory, an measured sequential SSD read of about 17.5 GB/s, and an approximate unified-memory bandwidth of ~400 GB/s used during profiling. (github.com)