Gemma 4 runs natively on M5
A demo showed Gemma 4 (26B) running 100% natively on an M5 Max MacBook Pro with Swift inference at over 100 tokens/sec, using Apple's MLX stack for uncensored local reasoning. The demonstration highlights high-throughput local model inference on Apple Silicon without offloading to servers. For teams building local-first features, the result is a concrete data point about what current client hardware can achieve. (x.com)
A language model is software that predicts the next token — a chunk of text — and speed is usually measured in tokens per second. In a recent demo, Gemma 4’s 26B A4B model ran entirely on an M5 Max MacBook Pro at more than 100 tokens a second in native Swift, without sending work to a server. (youtube.com) (blog.google) Google introduced Gemma 4 on April 2, 2026 as an open-weight model family with four sizes, including a 26B Mixture-of-Experts model and a 31B dense model. Google says the models are built for reasoning and agentic workflows, and the 26B variant is aimed at consumer GPUs and workstations rather than phones. (blog.google) (ai.google.dev) Mixture-of-Experts is a design that activates only part of a model at a time, like calling a few specialists instead of the whole office. Google’s model card labels the 26B release as “26B A4B,” meaning 26 billion total parameters with about 4 billion active for each token, which lowers the amount of compute needed during generation. (ai.google.dev) (artificialanalysis.ai) The demo used Apple’s MLX stack, which Apple describes as a machine-learning framework optimized for Apple silicon and its unified memory design. Apple’s MLX project includes Swift, C++, and C bindings, which is why a native Swift inference path is possible without routing through Python. (opensource.apple.com) (github.com) Apple shipped MacBook Pro models with M5 Pro and M5 Max on March 3, 2026, and the M5 Max configuration supports up to 614GB/s of memory bandwidth and up to 128GB of unified memory. Those specs matter for local inference because large models move huge amounts of weights and cache data through memory, and Apple’s design keeps CPU and GPU on the same memory pool. (apple.com) (support.apple.com) (github.com) The SwiftLM demo describes the setup as “100% native Metal & Swift” and says it avoids Python overhead while exposing an OpenAI-compatible application programming interface. The same demo claims techniques including key-value cache compression and solid-state-drive expert streaming to fit larger contexts and models on Apple hardware with less active memory use. (youtube.com) Google has spent the past year pushing Gemma toward edge and on-device use, including support for laptops, mobile devices, and Android’s AI Core developer preview. The model card says Gemma 4 supports up to a 256K-token context window, multimodal input across the family, and more than 140 languages. (developers.googleblog.com) (ai.google.dev) Not every local stack is hitting the same numbers on the same class of hardware. In an Ollama issue filed in April 2026, a user reported roughly 75 tokens a second on Gemma 4 26B MoE on an M5 Max with a different runner, suggesting that software choices, quantization, and memory handling are still moving the ceiling. (github.com) The practical point is simple: a laptop-class Mac is now being shown running a reasoning model in the 26B range at interactive speed without cloud offload. For developers weighing local-first features, the question is shifting from whether it can run on-device to which stack gets the most out of the hardware. (youtube.com) (apple.com)