35B agent runs on $600 M4 Mac mini
A demo shows a 35B‑parameter agent running on a $600 M4 Mac mini (16GB RAM) hitting ~30 tokens/sec via SSD streaming—claimed to outperform comparable NVIDIA setups by ~18x without cloud costs. If accurate, it underscores Apple Silicon’s edge for cost‑effective local inference in engineering workflows. (x.com)
The demonstration is published as an open-source agent project named "mac-code" (author: walter-grace) that stitches an MLX backend, quantized KV cache, and SSD‑paged Mixture‑of‑Experts inference to run Qwen3.5‑35B‑A3B on Apple Silicon. (github.com) Independent codebases implementing the same idea include flash‑moe, ssdmoe, and kandiga, each reporting materially different sustained decode rates depending on implementation and tuning (flash‑moe: ~11.5 tok/s with K=6; ssdmoe: 7–12 tok/s warm vs cold; kandiga: ~3.5–6.5 tok/s by K). (github.com) Those per‑repo deltas trace to three measurable runtime variables: routed-top-K (active experts per token) which directly multiplies SSD reads, SSD sustained read bandwidth (ssdmoe measured ~5.6 GB/s on an internal drive), and implementation stack (pure C/Metal kernels in flash‑moe vs Python+MLX wrappers) that change kernel‑launch and memory‑traffic overheads. (github.com) Model disk and memory footprints are listed in the projects: expert shards and repacked files occupy roughly 17–19 GB on SSD while active runtime memory for shared layers plus KV cache ranges from ~1 GB to ~2.5 GB in reported runs, enabling operation on 8–16 GB unified‑memory Macs. (github.com) Several repos publish end‑to‑end agent integrations rather than microbenchmarks: mac‑code bundles MLX/llama.cpp backends for tool calling and persistent context, and flash‑moe documents a 2.5 s time‑to‑first‑token in its optimized pipeline for production‑style flows. (github.com) Reproducibility caveats are explicit in the sources: warm server/page‑cache hot states consistently raise throughput versus fresh cold starts, model snapshot format and expert repacking affect IO patterns, and quantization choices alter CPU vs GPU work distribution. (github.com) The ecosystem signal is operational: teams wanting to replicate these results must instrument SSD sustained‑read testing, standardize expert‑repacking pipelines, and invest in Metal/NEON kernel tuning and MLX integration to close the gap between prototype demos and reliable, production agent deployments. (github.com)