128GB Mac runs Deepseek at 50+ tokens/s

- Jun Song posted on X on May 17 that a 128GB Apple M5 Max Mac ran DeepSeek V4 Flash locally at more than 50 tokens per second. - The clearest benchmark now publicly visible is 49.9 tokens per second at 4x batch on an M5 Max 128GB oMLX run. - More benchmark pages are live on oMLX, while NVIDIA forum and GitHub posts continue tracking DGX Spark DeepSeek V4 Flash runs.

Jun Song said in a May 17 post on X that a single 128GB Mac with Apple’s M5 Max chip could run DeepSeek V4 Flash at more than 50 tokens per second. The claim circulated alongside a comparison to NVIDIA’s DGX Spark systems, which Song said were slower in his tests. Public benchmark pages and vendor documentation reviewed on May 18 support parts of that picture, though the exact X post metrics could not be independently retrieved from X’s public search results. Apple introduced MacBook Pro models with M5 Max on March 3, and NVIDIA markets DGX Spark as a 128GB desktop AI system built around its GB10 Grace Blackwell chip. ### What is the performance number that can be verified right now? oMLX benchmark pages crawled on May 17 and May 18 show DeepSeek-V4-Flash-2bit-DQ running on an M5 Max with 128GB of memory at generation speeds in the high 30s to low 40s tokens per second for single-stream runs, depending on context length. One page shows 43.1 tokens per second at 1,024 context, another shows 39.1 at 4,096 context, and another shows 38.3 at 16,384 context. (apple.com) The same May 7 oMLX page shows a batching table with 49.9 tokens per second at 4x batch, which is the closest independently visible public number to the “50+ tokens/s” claim. Another oMLX entry crawled yesterday showed 39.9 tokens per second at 4,096 context and 43.7 at 1,024 context on the same M5 Max 128GB class of system. ### What exactly is the Mac hardware in question? (omlx.ai) Apple said on March 3 that its new 14-inch and 16-inch MacBook Pro models ship with M5 Pro and M5 Max chips and higher unified memory bandwidth. Apple also said the machines were available for pre-order on March 4 and began reaching customers on March 11. (omlx.ai) The benchmark pages identify the tested machine as an “M5 Max (40c)” with 128 GB of memory running macOS 26.4.1 and oMLX development builds. Those pages list peak memory use around 91 GB for the DeepSeek V4 Flash quantized runs shown. ### What do the DGX Spark comparison numbers show? NVIDIA says DGX Spark delivers up to one petaFLOP of FP4 AI performance and includes 128 GB of coherent unified system memory. (apple.com) NVIDIA also says the system is designed to run AI models up to 200 billion parameters on the desktop. A GitHub repository updated on May 4 for running DeepSeek V4 Flash on dual DGX Spark nodes reported about 14 tokens per second for Chinese and English question-answering and code workloads, with the English run showing occasional bad starting tokens. (omlx.ai) That project specifies two DGX Spark units, each with 128GB of memory, linked over 200Gbps RDMA. A Docker image page tied to the same effort describes the setup as running at about 12 tokens per second. (nvidia.com) A separate NVIDIA developer forum post from last week described a “2-bit hybrid quantization” of DeepSeek V4 Flash that fits on a single DGX Spark and asked for help validating results across more units and workloads. The forum post framed the comparison directly against a 128GB Apple Silicon Mac running MLX-based recipes. (github.com) ### Are these benchmarks measuring the same thing? The benchmark pages use multiple metrics, including prompt-processing throughput, token-generation throughput and time-to-first-token. The M5 Max oMLX pages list prompt-processing rates above 500 tokens per second in some 4,096-token tests, while token-generation rates remain around 39 to 43 tokens per second for single-stream runs. (forums.developer.nvidia.com) The DGX Spark GitHub project reports a simpler “~14 tok/s” figure tied to end-to-end question-answering and code generation on dual nodes. Because the public sources do not present a matched methodology, matched quantization, or matched batching setup, the available evidence supports that the Mac results are faster in the cited public runs, but not that the systems were benchmarked under a single standardized test. (omlx.ai) That is an inference from the published benchmark formats and descriptions. ### Where can readers watch this story develop? oMLX benchmark pages for DeepSeek V4 Flash on M5 Max were still updating as of May 17, with entries spanning 1,024 through 64,000 context lengths. NVIDIA’s DGX Spark page remains live with product specifications, and the GitHub and NVIDIA forum threads tied to DeepSeek V4 Flash on Spark were both active in May. (omlx.ai) (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.