Apple Cider inference framework ships
- Mininglamp-AI published Cider on GitHub, an Apple Silicon inference project that plugs into MLX and targets faster local LLM prefill on newer Macs. - The repo’s benchmark claims 1.15–1.21× speedups on Qwen3-VL-2B on an M5 Pro, while an experimental path shows up to 1.17× on M4. - It matters because MLX is Apple’s main open framework here, and Cider is trying to unlock hardware paths MLX does not expose yet.
Apple Silicon inference is getting weirdly interesting. Not because Apple shipped a new official framework today — it didn’t — but because a third-party project called Cider landed with a very specific pitch: make local LLM inference on Macs faster by using hardware paths that MLX doesn’t fully tap yet. The stakes are simple. If this works, developers can do more of the build-test loop on one Mac instead of bouncing jobs to the cloud. The gap is also clear — Apple has good local ML tooling, but some low-level acceleration paths still look underused. (github.com) ### What is Cider, exactly? Cider is an open-source project from Mininglamp-AI. It is not an Apple release, and that distinction matters because some early chatter blurred the line. The repo describes two main ideas: a native W8A8 inference path for Apple Silicon, and an experimental ANE-plus-GPU split path for prefill. I(github.com)ac hardware. (github.com) ### Why does “prefill” matter so much? LLM inference has two very different phases. Prefill is when the model chews through the whole prompt and builds its internal state. Decode is the token-by-token generation after that. Prefill is the expensive part for coding agents, long context, and multimodal prompts — basically t(github.com)squarely at that bottleneck. (github.com) ### What changed this week? The concrete news is the public GitHub release and benchmark write-up. Cider’s benchmark page says it reaches 1.15–1.21× prefill speedups over a W8A16 baseline on Qwen3-VL-2B running on an M5 Pro, while keeping perplexity roughly flat in its Llama-3-8B check. A separate experimental README show(github.com)ains, but they are real enough to matter in iterative local workflows. (github.com) ### What is the trick? The core trick is not “better prompts” or a new model. It is quantization plus hardware utilization. Cider says Apple’s M5 added INT8 TensorOps with 2× the TOPS of FP16, but that MLX’s public quantized path still dequantizes weights and runs the matmul in FP16 rather than end-to-end INT8. So Cider a(github.com)e that faster path. Think of it like finding a second lane on the highway that exists physically but isn’t open to traffic in the default stack. (github.com) ### Is this an MLX killer? No. It is more like an MLX sidecar. Cider’s own docs say the contribution is not a novel raw matmul kernel, and its experimental ANE path is built as MLX custom primitives. That means MLX still looks like the base layer for a lot of this work. Cider is trying to expose performance that MLX either does not expose publicly yet or does not route end-to-end in the same way. (github.com) ### What’s the catch? There are a few. The ANE path is explicitly labeled an experimental research prototype. It uses private APIs, is tested on M4, and the repo warns that M5 ANE API changes may break it. Even the stronger M5 story is benchmark-driven and narrow — mostly prefill, specific models, specific quantization settings. So this is promising engineering, not a settled new standard. (github.com) ### Why should developers care? Because local AI development is increasingly about iteration speed, not just peak tokens per second. If a Mac can run prompt-heavy coding or multimodal loops faster without shipping data to a server, teams get cheaper experiments, tighter privacy, and fewer cloud dependencies. Apple has already been pushing on-device an(github.com)ck. Cider shows there is still room for outsiders to squeeze more out of the same machines. (github.com) ### Bottom line? The real story is not that Apple “shipped Cider.” It’s that a third-party team just demonstrated there may be unused inference headroom on Apple Silicon — and local AI developers are going to chase it. (github.com)