M4 Mac Mini runs 35B Qwen

Bench posts show an M4 Mac Mini running a 35‑billion‑parameter Qwen model at about 17.3 tokens per second on 16GB of RAM, highlighting heavier local LLM throughput on compact hardware (x.com). The same thread argues the Apple Neural Engine is still underused in some workflows, which is part of the debate around where local model acceleration yields the most benefit (x.com).

A large language model is a text predictor, and the larger it gets, the more memory and compute it usually needs. Posts this week showed a base Mac mini with Apple’s M4 chip generating about 17.3 tokens a second with a roughly 35-billion-parameter Qwen model on 16 gigabytes of unified memory. (x.com) Apple sells the M4 Mac mini starting at $599 with a 10-core central processing unit, 10-core graphics processing unit, 16-core Neural Engine, and 16 gigabytes of memory in the base configuration. That is the same memory size cited in the benchmark post. (apple.com) The Qwen model family includes 32.5-billion-parameter releases such as Qwen2.5-Coder-32B-Instruct and QwQ-32B, both published by the Qwen team on Hugging Face. Those model cards list 32.5 billion parameters and 131,072-token context windows. (huggingface.co, huggingface.co) Apple’s MLX software is the main reason these demos are possible on small Macs. Apple describes MLX as an array framework optimized for Apple silicon’s unified memory, which lets the central processor and graphics processor work from the same memory pool instead of copying data back and forth. (opensource.apple.com, mlx-framework.org) MLX can run model operations on the central processor or graphics processor, and Apple says it includes tools for text generation and fine-tuning on Apple silicon. Apple’s public MLX materials do not say it runs large language model inference on the Apple Neural Engine by default. (mlx-framework.org, github.com) Apple makes a separate case for the Neural Engine through Core ML, its app deployment framework for on-device machine learning. Apple says Core ML is designed to use the central processor, graphics processor, and Neural Engine “in the most efficient way,” and Apple has separately published research on deploying transformer models on the Neural Engine. (developer.apple.com, machinelearning.apple.com) That split helps explain the argument in the benchmark thread. A developer using MLX for local chat may see strong results from unified-memory graphics workloads, while a developer converting a model to Core ML may focus on Neural Engine utilization, power draw, and app integration instead. (mlx-framework.org, developer.apple.com) Apple has been pushing both tracks at once. In November 2024, Apple published a guide on running Llama 3.1 on-device with Core ML, and in 2025 it said its own on-device Apple Intelligence language model was about 3 billion parameters and optimized with techniques including 2-bit quantization-aware training. (machinelearning.apple.com, machinelearning.apple.com) The immediate takeaway from the Mac mini posts is narrower than the hype around them: a $599 desktop with 16 gigabytes of shared memory can now run a quantized Qwen model in the 32-billion-parameter class at interactive speed. The next round of testing will decide whether developers care more about raw tokens per second, lower power use, or better Neural Engine support. (apple.com, x.com)

M4 Mac Mini runs 35B Qwen

Get your own daily briefing