M1 Max inference wins

- Benchmarks show an M1 Max 32GB runs Qwen 3.6 35B at roughly 66 tokens per second on inference workloads. - Reported unified memory bandwidth for the M1 Max is about 400 GB/s, aiding inference performance. - Comparisons note small clusters of M4s can match A100 efficiency, but A100 still outpaces for raw training throughput. ( )

Apple’s 2021 M1 Max is posting inference numbers that put a 32GB Mac back into the local artificial intelligence conversation: one recent benchmark showed Qwen 3.6 35B running at about 66 tokens per second. (huggingface.co, x.com) Inference is the part where a model generates text after training is finished, and memory speed often sets the pace because the chip has to keep fetching model weights. Apple lists the M1 Max at 400 gigabytes per second of unified memory bandwidth in 32GB and 64GB configurations. (support.apple.com) Qwen 3.6-35B-A3B is a 35 billion-parameter model with about 3 billion parameters activated at a time, a mixture-of-experts design that reduces how much of the network runs on each token. The model card says it is compatible with common inference stacks including Hugging Face Transformers, vLLM, SGLang, and KTransformers. (huggingface.co) Apple’s advantage in this setup is unified memory, which means the processor and graphics cores read from the same pool instead of copying data back and forth. That matters for local model serving, where moving tens of gigabytes can become the bottleneck before raw math does. (apple.com, support.apple.com) Newer Apple silicon pushes that idea further. Apple says the 2024 M4 Pro reaches 273GB/s of memory bandwidth, while M4 Max configurations reach 410GB/s or 546GB/s, numbers that have prompted developers to test small Mac clusters for inference efficiency. (support.apple.com) Nvidia’s A100 still sits in a different class for large-scale training. Nvidia’s datasheet lists up to 80GB of high-bandwidth memory and about 2TB/s of memory bandwidth on the 80GB SXM version, alongside tensor performance aimed at data-center training and inference workloads. (nvidia.com) That leaves two separate comparisons in play. A Mac can look strong on local inference per watt or per dollar for a developer running one model at a desk, while an A100 still offers far more headroom for training runs, larger batches, and multi-GPU scaling in servers. (nvidia.com, support.apple.com) The timing also lines up with a shift in open models. Qwen released the first open-weight Qwen 3.6 variant in April 2026, positioning it for coding and agent-style work, which are exactly the workloads hobbyists and startups often test on local machines first. (huggingface.co, github.com) The result is a narrower question than “which chip is fastest.” For a user who wants to run a modern 35B-class model locally, the data point getting attention is that a four-and-a-half-year-old M1 Max can still produce output fast enough to feel interactive. (x.com, support.apple.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.