Qwen3.6 quant runs on M‑series

- Developers are packaging Alibaba’s Qwen3.6 models into Apple MLX formats, letting recent M-series Macs run new local chat and vision workloads. - One Qwen3.6-35B-A3B MLX build shrinks to about 9 gigabytes at 2-bit, while vllm-mlx reports up to 4.3x throughput scaling under load. - The work extends Apple Silicon’s role in local inference and on-device filtering for private text pipelines. (arxiv.org)

Large language models are being squeezed to fit on Mac laptops by compressing their weights and reworking the software stack around Apple’s chips. (mlx-framework.org) (huggingface.co) The model at the center of this wave is Qwen3.6, Alibaba’s open-weight family that now includes a 35 billion-parameter mixture-of-experts version with about 3 billion parameters active per token. That design means the model is large on paper but lighter at runtime than a dense model of the same size. (docs.vllm.ai) (huggingface.co) To make that practical on a Mac, developers are quantizing the model, which means storing numbers with fewer bits so it uses less memory. One recent MLX release of Qwen3.6-35B-A3B lists an approximate size of 9 gigabytes in 2-bit form. (huggingface.co) That same model card says its RotorQuant cache method delivers 5.3 times faster prompt prefill and 28 percent faster decode than TurboQuant, another compression scheme for the running memory state. The tradeoff is accuracy risk at the most aggressive settings, which the card says is best suited to experimentation and hardware-constrained setups. (huggingface.co) The other half of the story is serving software. Wayner Barrios’s open-source vllm-mlx project uses Apple’s MLX framework to run text, vision, audio and embedding models behind OpenAI-compatible and Anthropic-compatible interfaces on M1 through M4 Macs. (github.com) (pypi.org) In a January 2026 paper, the vllm-mlx team reported text throughput 21 percent to 87 percent higher than llama.cpp on tested models, plus continuous batching that scaled aggregate throughput by as much as 4.3 times at 16 concurrent requests. The same paper reported up to 525 tokens per second on Apple M4 Max for text workloads. (arxiv.org) For image-heavy assistants, the paper adds a cache that reuses work when the same image appears again. The authors reported a 28 times speedup on repeated image queries and a drop in multimodal latency from 21.7 seconds to under 1 second on Apple M4 Max. (arxiv.org) Privacy tools are moving onto the same hardware. OpenAI published Privacy Filter on April 23, 2026 as a 1.5 billion-parameter token-classification model with 50 million active parameters, built to detect and mask eight categories of personal or secret data in a single pass. (github.com) A separate Apple Silicon port, opf-privacy-filter-mlx, wraps that model in a local FastAPI server for M1 through M4 Macs. Its sample output shows a redaction response in 42.5 milliseconds for a short test string, though that figure comes from the project README rather than an independent benchmark. (github.com) Independent testing on a 16-gigabyte M2 MacBook Pro found OpenAI Privacy Filter used about 2.8 gigabytes in BF16 mode and posted roughly 1 second median inference on Apple’s Metal Performance Shaders backend. That benchmark also found misses on some AWS keys, MongoDB and Redis URIs, and some names and phone formats. (instavar.com) (github.com) Put together, the shift is less about one viral benchmark than about a stack coming into focus: compressed Qwen3.6 weights, MLX-native runtimes, and local privacy filters that can sit in front of cloud tools. On Apple Silicon, the laptop is starting to look more like the first server. (mlx-framework.org) (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.