M5 Max vs NVIDIA benchmark

- A community benchmark ran Qwen3.6‑35B‑A3B on Apple M5 Max and RTX 3090/4090/5090 using llama.cpp. - The test compared on‑device Apple Silicon against multiple NVIDIA GPUs across a 35‑billion‑parameter decoder model. - Results highlight Apple Silicon’s competitiveness on selected inference workloads versus discrete GPUs. (x.com)

Running a 35-billion-parameter language model on a laptop looked like a desktop GPU job until a community test put Apple’s M5 Max in the same llama.cpp benchmark as Nvidia’s RTX 3090, 4090, and 5090. (youtube.com) The test used Qwen3.6-35B-A3B, a newly released open model from Alibaba’s Qwen team, and the same inference engine, prompts, and settings across all four machines. The benchmark video was posted on April 20, 2026 by the Tech-Practice channel. (github.com) (youtube.com) A model like Qwen3.6-35B-A3B does not use all 35 billion parameters for every token. Qwen’s repository describes it as a sparse “Mixture of Experts” system, and independent testing says only about 3.6 billion parameters are active per forward pass, which cuts the compute load enough for consumer hardware. (github.com) (aminrj.com) llama.cpp is the software layer that makes this comparison possible. Its GitHub project describes it as a C and C++ inference engine built to run large language models locally across a wide range of hardware, including Apple Metal and Nvidia CUDA backends. (github.com) Apple’s side of the comparison rests on memory design, not just raw graphics speed. Apple says the top M5 Max configuration ships with up to 128 gigabytes of unified memory and 614 gigabytes per second of memory bandwidth in the 16-inch MacBook Pro introduced in March 2026. (support.apple.com) (apple.com) Nvidia’s cards still bring far more specialized graphics horsepower, but their local-model limits are often set by video memory. A separate April 17, 2026 write-up on Qwen3.6 running in llama.cpp reported the model using about 24.2 gigabytes of video memory at a 65,000-token context window, which fits a 24GB-class card only with careful settings. (aminrj.com) That is why this kind of benchmark gets attention from developers who run models on their own machines instead of rented cloud servers. The question is no longer only which chip is fastest, but which machine can hold the model, keep a long context window, and generate tokens at usable speed without extra hardware. (support.apple.com) (aminrj.com) The comparison also lands a month after Apple launched the M5 Max MacBook Pro and four days after Qwen’s GitHub repository added Qwen3.6-35B-A3B. That compressed timeline means the benchmark is an early read on a new chip and a new model, not a settled industry standard. (apple.com) (github.com) The headline from the test is not that a laptop replaces every high-end GPU. It is that, on one fresh local-inference workload with matched llama.cpp settings, Apple’s M5 Max belongs in the same conversation as Nvidia’s enthusiast cards. (youtube.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.