Benchmarks show 8-bit quantized large models running efficiently on Apple Silicon using MLX
- Apple Silicon AI tinkerers spent the past two weeks showing that MLX can run surprisingly large local models — including Qwen 3.6 and DeepSeek V4 variants. - The sharpest claim came from Rapid-MLX, which says its OpenAI-compatible server is 4.2x faster than Ollama and can hit 0.08-second cached TTFT. - The bigger shift is practical, not theoretical — runtime and quantization choices now decide whether a Mac feels toy-like or genuinely useful.
Local AI on Macs is getting weirdly good. Not because Apple suddenly shipped some secret new chip, but because people are getting much better at squeezing big models through MLX — Apple’s machine-learning stack for Apple Silicon. That changes the real question from “can a Mac run this model?” to “which runtime, which quantization, and how much unified memory do you have?” Over the last two weeks, the most interesting demos and repos all pointed the same way: 8-bit and mixed-precision setups are turning Apple laptops and desktops into credible local inference boxes. ### What is MLX actually doing? MLX is Apple’s array framework for machine learning on Apple Silicon. The important part is not the branding — it’s the memory model. CPU and GPU can work from the same unified memory pool, so large models don’t pay the same copy penalties they often do on other setups. Apple has been leaning into this for LLM inference, and its own research showed MLX running native and quantized Qwen models on M4 and M5 Macs with meaningful gains from quantization and GPU-side optimizations. (mlx-framework.org) ### Why does 8-bit matter so much? Because memory is the wall. A model that is too big in BF16 can become runnable in 8-bit, and sometimes comfortably runnable in 4-bit or mixed formats. That is the whole trick. Qwen’s official Hugging Face releases now include MLX 8-bit variants, and community builders are doing the same for larger vision and language models. The point is not just “smaller files.” Fewer bits means less bandwidth pressure, which matters a lot on Apple Silicon where large-model inference often ends up memory-bandwidth-bound. (mlx-framework.org) ### Why are people talking about Qwen 3.6? Because it is a clean demo of the new ceiling. Recent YouTube benchmarks focused on Qwen 3.6 27B in MLX-friendly or Mac-oriented setups, framing it as a model that feels much more “flagship” than older local dense models while still being realistic on high-end Apple hardware. There is also a growing benchmark ecosystem around the Qwen family on MLX, including repos that test the full range from tiny to 35B-class models on Macs. (huggingface.co) That gives developers something better than vibes — they can compare speed, memory use, and quality tradeoffs directly. ### What about the Rapid-MLX claim? Rapid-MLX is the loudest runtime story right now. Its GitHub repo pitches it as an OpenAI-compatible local server for Apple Silicon, with claims of 4.2x speed over Ollama, 0.08-second cached time-to-first-token, prompt caching, and tool-calling support. The catch is that these are project benchmarks, not a neutral industry bake-off. But even taken cautiously, the repo captures the real shift: developers are no longer just comparing chips. (youtube.com) They are comparing serving layers, cache behavior, and streaming pipelines. ### Why does DeepSeek V4 on a Mac matter? Because it shows how far quantization can stretch the category. Experimental projects and model ports are now targeting DeepSeek V4 Flash on 128GB Macs, including 2-bit and mixed-precision approaches aimed specifically at fitting routed experts into Mac memory budgets. One llama.cpp fork says that target out loud — 128GB MacBooks with 2-bit quantization of routed experts. That is not mainstream, and it is definitely hacky. (github.com) But it is a real signal that “big MoE model on a Mac” has moved from joke to engineering problem. ### So is Apple Silicon winning? Not exactly. Nvidia boxes still dominate raw throughput and flexibility. But that is not the contest these demos are trying to win. The Mac pitch is simpler — quiet machine, one memory pool, no cloud bill, and enough local performance for coding, agents, search, and multimodal work if you choose the right model format. Basically, Apple Silicon is becoming the easiest place to do serious local inference without turning your desk into a space heater. (github.com) ### What is the real lesson here? The lesson is that “model size” is no longer the headline number. Runtime, quantization, cache behavior, and memory layout matter just as much. Two Macs with the same chip can feel very different depending on whether the model is BF16, 8-bit, 4-bit, or a mixed expert format — and whether the runtime is MLX-native or just passing through. That is why these benchmarks matter. They are mapping the difference between technically possible and actually pleasant. (machinelearning.apple.com) ### Bottom line? Apple Silicon did not magically become a datacenter. But MLX plus aggressive quantization is making large local models feel normal on Macs much faster than most people expected. (mlx-framework.org) (machinelearning.apple.com)