LLM speedups on Apple Silicon via Metal
A demo of dflash‑mlx shows roughly 4x speed gains for LLM inference on Macs by using Metal with block‑diffusion speculative decoding and hand‑optimized components. The work is presented as an implementation that leverages Apple Silicon GPU paths for faster model runs. (x.com)
Large language models on Macs are getting much faster: a new Apple Silicon port of DFlash reports about 4 times the output speed of standard MLX runs on some Qwen models. (github.com) Language models usually write one token at a time, which keeps inference sequential and leaves graphics processors waiting between steps. DFlash changes that by having a smaller “draft” model propose a block of 16 tokens at once, then having the larger target model verify that block in one pass. (arxiv.org) The Apple Silicon implementation is built on MLX, Apple’s machine learning framework for its own chips, and uses Metal, Apple’s low-level graphics and compute layer, to reach the graphics processor directly. Apple says Metal is designed for machine learning workloads on Apple silicon, including Macs with M-series chips. (mlx-framework.org) (developer.apple.com) In the repository published today, developer bstnxbt reported 4.10 times speedup on Qwen3.5-4B at 2,048 tokens, rising from 53.74 tokens per second to 219.83. The same benchmark shows Qwen3.5-9B at 2,048 tokens rising from 30.96 tokens per second to 127.07, a 4.13 times gain. (github.com) The code describes the output as “lossless,” meaning every token that is finally emitted has been checked by the target model before it is accepted. In the posted benchmarks, acceptance rates cluster around 88 percent to 89 percent for the 4 billion and 9 billion parameter Qwen3.5 models. (github.com) Those gains are smaller on larger quantized models in the same table. Qwen3.5-27B-4bit shows 1.90 times speedup at 2,048 tokens, and Qwen3.5-35B-A3B-4bit shows 1.69 times, suggesting the method helps most when the draft-and-verify loop can stay efficient relative to model size. (github.com) The underlying DFlash paper was posted to arXiv on February 5, 2026 by Jian Chen, Yesheng Liang and Zhijian Liu. The authors said their block-diffusion drafting method delivered more than 6 times lossless acceleration in experiments and up to 2.5 times the speedup of EAGLE-3, a prior speculative decoding method. (arxiv.org) Getting that idea onto Macs required extra engineering because MLX does not ship with speculative-decoding building blocks. A separate Apple Silicon DFlash repository says its developers had to build the draft-and-verify loop on top of Metal, including hidden-state extraction from the target model and custom cache rollback for partial acceptance. (github.com) The new repository also says it uses targeted kernels for rollback and long-context verification, including a custom Metal attention kernel for contexts of 1,024 tokens or more. That is the part of the work aimed at turning a research decoding trick into something that runs cleanly on consumer Macs. (github.com) For people running models locally, the pitch is simple: keep the same target model’s checked output, but spend fewer full forward passes to get there. The early Apple Silicon numbers suggest that, on the right Qwen setups, that trade can move a Mac from roughly 31 to 54 tokens a second into the 127 to 220 range. (github.com)