gpt-oss releases 20B Mixture-of-Experts model tuned for Apple silicon (gpt-oss-20b-tq3)
- A new community release, manjunathshiva’s gpt-oss-20b-tq3, repackages OpenAI’s open-weight gpt-oss-20b into a 3-bit TurboQuant build aimed squarely at Macs. - The key number is size: about 9.5 GB on disk and roughly 11 GB peak wired RAM, with reported 60–80 tokens per second. - That matters because 20B-class MoE models usually feel too heavy for laptops, but Apple-tuned quantization is making local inference look practical.
A 20B model on a laptop used to mean compromise — or a lot of patience. This release changes the trade a bit. A community builder, manjunathshiva, posted a new quantized version of OpenAI’s open-weight gpt-oss-20b called gpt-oss-20b-tq3, and the whole point is simple: make a fairly serious Mixture-of-Experts model run locally on Apple silicon without needing workstation-class hardware. ### What actually got released? This is not a brand-new foundation model from OpenAI. The base is OpenAI’s existing gpt-oss-20b — listed at 21B total parameters, with 32 experts and about 3.6B active at a time — but repackaged into a TurboQuant 3-bit format for MLX-based local inference on Macs. The Hugging Face repo for gpt-oss-20b-tq3 went up this week, with model files totaling just under 10 GB. ### Why does “Mixture-of-Experts” matter here? A Mixture-of-Experts model does not light up the whole network for every token. It routes work through a subset of experts, which is why a model with 21B total parameters can behave more like a much smaller active model at runtime. Basically, you get access to a larger knowledge-and-capacity budget without paying the full computational cost in the first place — and why it’s a good target for aggressive local quantization. ### What’s special about the tq3 version? The trick is TurboQuant at 3 bits. The model card describes a data-free quantization setup using Hadamard rotation and a Lloyd-Max codebook, with group size 64. In plain English, the weights get compressed much harder than standard 4-bit or 8-bit community builds, but in a way meant to preserve enough structure to stay usable. This reduces peak wired RAM during decode. That is the difference between “maybe on a high-end desktop” and “actually on a 16 GB Mac.” ### Why is Apple silicon the angle? Because MLX is Apple’s local machine learning stack, and the release is tuned around that environment. The repo says it runs on M1, M2, M3, and M4 systems with 16 GB or more unified memory, using `turboquant-mlx-full` and `mlx-lm`. That matters more than it sounds. A lot of open model releases are technically local but practically GPU-first. This one is trying to meet the hardware people already own. ### How fast is it really? The posted numbers are surprisingly solid for the size class: 60–80 tokens per second on M-series Macs, with up to 73 tok/s specifically called out on an M4 Max using an fp16 KV cache. Those are repo numbers, so treat them as best-case-ish. But even if real-world chat use lands lower, the headline is still that a compressed 20B-class MoE can feel interactive on consumer Apple hardware. ### Is this an official OpenAI release? No — the base model is OpenAI’s, but this tq3 build is a community quantization on Hugging Face. That distinction matters. OpenAI shipped the original gpt-oss-20b, while people in the ecosystem are now racing to make it smaller, faster, or easier to run in specific stacks like MLX, GGUF, and LM Studio. This release is part of that second wave — the optimization wave. ### What’s the catch? Compression always costs something. The model card itself hints at the boundary by saying gpt-oss-20b sits near the edge for multi-step reasoning. So this is not a magic free lunch. Think of it like folding a big paper map into your pocket — you keep most of the utility, but some detail gets harder to read. The win is accessibility, not perfection. ### Bottom line? The interesting part is not just one Hugging Face upload. It’s the direction. Open-weight MoE models plus Apple-specific quantization are making “run it locally on a Mac” feel less like a demo and more like a real deployment choice.