SwiftLM speeds M5 Pro 2.5x

- Simba Zhang on May 15, 2026 posted SwiftLM benchmark results showing MTP speculative decoding and TurboQuant running on Apple’s M5 Pro. - The headline figure was a reported 2.5x speedup at 100,000-token context, alongside a 56% GPU-memory reduction for Gemma 4-26B. - SwiftLM’s public GitHub repository added MTP benchmark documentation this week, and Zhang’s X post contains the cited benchmark screenshots.

Simba Zhang posted benchmark results on May 15 showing SwiftLM running two inference optimizations on Apple’s M5 Pro: MTP speculative decoding to raise generation throughput and TurboQuant to shrink key-value cache memory use. The results, published in an X post and reflected in recent updates to SwiftLM’s public GitHub repository, centered on long-context local inference on Apple Silicon. Zhang said the setup delivered a 2.5x speedup at 100,000-token context lengths. He also showed a 56% reduction in GPU memory use for Gemma 4-26B. Those figures matter because long context windows often turn memory, not raw compute, into the main limit on running large models locally. Google Research said in a March 24 post that TurboQuant is designed to compress vectors used in model inference, including key-value cache data, to lower memory costs without hurting model quality. SwiftLM describes itself as a native Swift inference server for Apple Silicon built on MLX, with an OpenAI-compatible API and TurboQuant key-value cache compression. ### What exactly did Zhang say improved on the M5 Pro? Zhang’s May 15 post said SwiftLM combined MTP speculative decoding with TurboQuant on an M5 Pro system and reached a 2.5x speedup at 100k context. The post also said the same optimization path cut GPU memory use by 56% for Gemma 4-26B. The GitHub repository for SharpAI’s SwiftLM showed a README update “yesterday” adding “MTP speculative decoding benchmark results (M5 Pro 64GB).” That update does not, by itself, independently reproduce every number in Zhang’s post, but it does show the project is actively documenting the benchmark path Zhang referenced. ### What are MTP speculative decoding and TurboQuant doing here? Speculative decoding is a runtime technique that uses a smaller or faster draft path to propose tokens before a larger model verifies them. The goal is to reduce the serial cost of generation, which is especially painful during long responses. TurboQuant is a compression method aimed at reducing the memory footprint of high-dimensional vectors, including the key-value cache used during transformer inference. (github.com) Google Research said on March 24 that TurboQuant targets the “memory overhead” problem in vector quantization and is intended to lower memory costs in AI systems. In Zhang’s demonstration, that translates into less GPU memory pressure during long-context runs on Apple Silicon. ### Why does the 100,000-token context number matter? A 100,000-token context window is large enough that cache growth can dominate the memory budget on a local machine. SwiftLM’s own project materials and related demos emphasize long-context operation on Apple Silicon, including 100k-scale windows on Macs with unified memory. That is why the two reported gains fit together. The 2.5x figure addresses throughput at long context, while the 56% figure addresses the memory strain that usually comes with keeping a large cache resident. (research.google) Zhang’s post presented those as paired optimizations rather than separate wins. ### Why is Apple Silicon the setting for this benchmark? SwiftLM is built specifically for Apple Silicon and markets itself as a native Swift and MLX inference server rather than a Python-based stack. The repository says the software is compiled to a single binary and is designed for local serving on Macs and iOS devices. Apple’s unified memory architecture changes the trade-offs for local inference because model weights, cache state and the rest of the system share one memory pool. (github.com) That makes memory compression techniques especially relevant when context windows grow. Zhang’s benchmark used an M5 Pro, and the repository’s recent README update specifically mentions M5 Pro 64GB benchmark results. ### How much should readers treat these numbers as settled? The reported figures come from a researcher’s public benchmark post and project documentation, not from a peer-reviewed benchmark suite or an independent lab comparison. Zhang named the model, hardware class and techniques, but the available public materials do not yet provide a full third-party replication package in the way a formal benchmark paper would. Still, the underlying pieces are public. (github.com) Google Research has described TurboQuant and said it will present the method at ICLR 2026, while SwiftLM’s GitHub repository is public and was updated this week with MTP benchmark documentation. Zhang’s X post remains the primary source for the 2.5x and 56% claims, and the next concrete checkpoint is whether the project publishes fuller benchmark scripts or reproduction notes in the repository. (research.google)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.