SuperQwen3.6‑35B Tuning

- A tuned, uncensored variant called SuperQwen3.6‑35B‑DFlash‑MLX was announced with strong benchmark results. - Developers indicated an MLX 4‑bit version is incoming, promising efficient local inference with minimal quality loss. - That progress suggests tuned, high‑performing models are moving toward low‑bit quantization for edge and desktop use. (x.com)

A language model predicts the next token, one step at a time; quantization stores those weights in fewer bits, like shrinking a photo file so it fits on a laptop. Apple’s MLX framework is one of the main ways people run those compressed models locally on Apple Silicon Macs. (huggingface.co 1) (huggingface.co 2) Qwen3.6-35B-A3B, released on Hugging Face last week, is a 35 billion-parameter open-weight model with about 3 billion parameters activated at once and native context length of 262,144 tokens. Qwen said the model was built as the first open-weight Qwen3.6 release after the February Qwen3.5 series. (huggingface.co) A separate add-on called Qwen3.6-35B-A3B-DFlash appeared on Hugging Face six days ago from Z Lab. Its model card says DFlash is a speculative decoding method that uses a lightweight block-diffusion draft model to propose multiple tokens in parallel, then pairs with the base Qwen3.6-35B-A3B model. (huggingface.co) Z Lab’s posted tests used a single NVIDIA B200 and SGLang with thinking enabled and 4,096-token outputs. In those results, DFlash raised Math500 throughput from 234 to 682 tokens a second at concurrency 1 and from 1,266 to 3,138 at concurrency 8, while HumanEval rose from 238 to 603 at concurrency 1. (huggingface.co) The tuning story sits on top of that base model and speed stack: developers are taking Qwen3.6, changing its behavior with fine-tuning or “abliteration,” and then repackaging it for MLX so it can run on Macs without a cloud server. Hugging Face already shows several Qwen3.6 MLX variants, including uncensored or “Heretic” builds and a generic 4-bit conversion. (huggingface.co 1) (huggingface.co 2) (huggingface.co 3) That matters because the storage gap is large. The MLX Community’s Qwen3.6-35B-A3B 4-bit conversion lists a 20.4 GB 4-bit package, while the base Qwen3.6 release is a full 35B-parameter checkpoint intended for frameworks such as Transformers, vLLM, SGLang, and KTransformers. (huggingface.co 1) (huggingface.co 2) The same pattern showed up one generation earlier. A Qwen3.5-35B-A3B 4-bit MLX conversion was already available last month at 20.4 GB, and independent benchmark projects were comparing local MLX engines for that class of model on Apple hardware. (huggingface.co) (github.com) Apple-side tooling has also been catching up. A Rust MLX inference engine for Qwen3.5-35B-A3B reported 112 tokens a second for single-user decode and more than 200 tokens a second aggregate on an M3 Ultra, showing the local-serving stack is being tuned alongside the models themselves. (github.com) The immediate next step is plain from the model listings: more Qwen3.6 derivatives are being quantized into 4-bit, mixed-bit, and other compressed formats within days of the base release. That is turning a newly released 35B-class model into something developers can test on a desktop instead of waiting for rented GPUs. (huggingface.co)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.