Apple‑era benchmarks show big on‑device gains
New on‑device benchmark posts say MLX delivered roughly 2.6x faster inference than PyTorch on Apple hardware by taking advantage of unified memory, and separate tests report 75–82 tokens/sec decode on an M2 Pro using optimized Swift runtimes. Those results indicate substantial wins come from stack optimisations—memory use, runtime and compiler—not just raw silicon cycles. The numbers underline how full‑stack software engineering can unlock practical performance gains on Apple Silicon. (x.com, x.com)
A large language model on a laptop does not spend most of its time “thinking.” It spends most of its time dragging billions of weight values out of memory, one token at a time, so the speed limit is often the memory system, not the math unit. (apple.com 1) (apple.com 2) Apple’s chips use unified memory, which means the central processor and graphics processor read the same pool of RAM instead of copying tensors back and forth like two cooks sharing one counter instead of walking dishes across the kitchen. Apple says MLX was built specifically for that unified-memory design on Apple silicon. (apple.com) (github.com) PyTorch on Mac uses the Metal Performance Shaders backend, which maps PyTorch graphs onto Apple’s graphics stack, but Apple still describes that backend as beta and points developers to nightly builds for the newest support. That makes the comparison less about “Apple chip versus Apple chip” and more about one software path versus another on the same hardware. (apple.com) MLX also does lazy computation, which means it waits to materialize arrays until the result is actually needed, the way a restaurant delays firing a dish until the table is ready. Apple lists lazy computation and graph optimization as core MLX features. (github.com) (apple.com) That is why recent Apple-silicon benchmark posts are getting attention: the headline numbers are not coming from a brand-new chip, but from changing the framework, runtime, and compiler around the same chip. In one public benchmark repository, MLX cut Whisper inference from 31.99 seconds to 8.50 seconds on an M1 Pro and cut TinyLlama inference from 59.27 seconds to 33.38 seconds on the same machine. (github.com) The same repository shows the gap widening on faster chips instead of disappearing. On an M3 Max, Whisper inference fell from 17.90 seconds in PyTorch to 4.85 seconds in MLX, and TinyLlama inference fell from 36.18 seconds to 15.41 seconds. (github.com) The newer posts go one step further by focusing on token decode speed, which is the rate at which a model emits text after the prompt is already loaded. That number is what you feel when a local chatbot streams words onto the screen. (x.com) One of those tests reports roughly 75 to 82 tokens per second on an M2 Pro using an optimized Swift runtime. Apple’s own MLX materials matter here because Apple says MLX has first-party Swift bindings, so the optimization is happening close to the platform’s native toolchain rather than through a generic cross-platform layer. (x.com) (apple.com) The M2 Pro itself is not a tiny chip: Apple says it offers 200 gigabytes per second of unified-memory bandwidth and up to 32 gigabytes of unified memory. When a workload is bottlenecked by moving weights, those bandwidth numbers are the pipe size, and better software decides how much of that pipe you actually use. (apple.com) Apple has been pushing that full-stack idea in public for a while. Its 2025 developer session on MLX highlighted unified memory, lazy computation, function transforms, and Swift APIs, and its later research on the M5 family showed MLX tapping Metal 4 tensor operations and new neural-accelerator features for faster inference again. (apple.com 1) (apple.com 2) So the surprise in these benchmark posts is not that Apple laptops can run models locally. The surprise is that the same silicon can look much faster when the framework stops fighting the memory layout and the runtime is tuned for the machine it is sitting on. (x.com) (apple.com)