Cider unlocks INT8 on Apple GPUs

- Mininglamp-AI open-sourced Cider this week, a macOS SDK built on MLX that adds true W8A8 and W4A8 inference paths on Apple Silicon. - The key claim is 1.2–1.9× faster LLM prefill on M5 by tapping Apple’s INT8 TensorOps, which MLX itself doesn’t expose directly. - That matters because MLX already handled weight-only quantization; Cider adds activation quantization too, making more local inference actually practical.

Apple’s MLX stack has been good at running models on Macs, but it had a hole in the middle. You could quantize weights, shrink memory use, and get decent local inference. But you could not really do the full low-precision trick people want for speed — run both weights and activations in INT8 on the GPU. Cider is the new piece that fills that gap on Apple Silicon, and the interesting part is that it does it by exposing hardware Apple already shipped. (github.com) ### What did Cider actually ship? Cider is an open-source SDK from Mininglamp-AI for macOS. It sits on top of MLX and adds custom primitives for W8A8 and W4A8 inference — shorthand for low-bit weights plus low-bit activations. In plain English, it gives Apple GPU inference a real INT8 execution path instead of treating quantization mostly as a storage trick. (github.com)s missing in MLX? MLX already supported weight-only quantization. That helps because smaller weights mean less memory pressure. But the compute path still dequantized those weights to FP16 and ran FP16 matrix multiplies. So you got some of the memory upside without the full compute-speed upside. Cider’s whole pitch is fused quantize-matmul-dequant kernels, so t(github.com)he model is packed on disk. (github.com) ### Why does INT8 matter so much? Most of LLM inference is matrix multiplication. If hardware has fast low-precision matrix units, using them can cut latency and memory traffic at the same time. That is especially useful for prefill — the expensive first pass where the model digests the prompt. Prefill is where local workflows often feel sluggish, because every test prompt has to pay that upfront cost before generation even starts. (github.com) ### So what changed on Apple hardware? Apple’s own ML research post from November 19, 2025 pointed to dedicated matrix-multiplication hardware in the M5 GPU and described MLX support for the new chip’s neural accelerators. Cider’s repo says the INT8 TensorOps extension is built only on Apple M5 and newer, which strongly suggests the project is targeting capabilities that (github.com)path in upstream MLX. Basically — the hardware was there, but the software plumbing was incomplete. (machinelearning.apple.com) ### How big are the gains? Mininglamp-AI’s headline number is 1.2–1.9× faster LLM prefill on M5. The repo also says per-channel quantization can hit about 1.8× prefill speedup, while a public demo using its Mano-P local GUI agent on an M5 Pro showed a smaller but still real gain: 2.839 seconds prefill in W8A16 versus 2.519 seconds in W8A8, o(machinelearning.apple.com) which tells you where the win is concentrated. (github.com) ### Why is the gain mostly in prefill? Because prefill is the dense, highly parallel part of inference. The GPU gets to chew through a big block of matrix math all at once. Decode is more incremental — one token at a time, more bottlenecked by memory movement and control overhead. So a faster INT8 matmul path helps the front-loaded part more than the steady token-by-token(github.com)ng. (github.com) ### Does this change what Macs are good for? A bit, yes. Macs were already attractive for local model work because of unified memory and MLX’s ergonomics. But they were still missing some of the low-precision inference paths developers expect on CUDA systems. Cider makes Apple Silicon more useful for fast local iteration — especially for prompt-heavy testing, agents, and (github.com)n peak decode bragging rights. (machinelearning.apple.com) ### Bottom line? Cider is not “Apple invented INT8 inference” news. It is more specific than that. An open-source team found a way to expose an M5-era Apple GPU capability that MLX did not package into a true W8A8 workflow, and the payoff is faster local prefill on Macs. For people building on-device AI features, that is the difference between quantization as a nice compression trick and quantization as an actual speed tool. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.