Qwen3.6 & Apple Inference Wins

- The Qwen3.6-35B-A3B mixture-of-experts model (35B total, 3B active) was open-sourced for agentic workloads. - Benchmarks show strong coding and reasoning performance, while Apple's M5 Max outperformed M3 Ultra for Qwen inference in MLX tests. - These developments suggest model routing plus efficient client silicon can materially reduce inference cost and latency for production use ( ).

A new Qwen model is putting a cheaper way to run coding agents in reach: 35 billion parameters on paper, 3 billion active at a time in practice. (qwen.ai) Qwen said on April 14 that it open-sourced Qwen3.6-35B-A3B, the first open-weight release in the Qwen3.6 family, with weights on Hugging Face and code in the Qwen3.6 GitHub repository. (qwen.ai) (huggingface.co) (github.com) The model uses a mixture-of-experts design, which works like routing each token to a small subset of specialists instead of waking up the whole network. Qwen’s model card says Qwen3.6-35B-A3B has 35B total parameters, 3B activated, 256 experts, and a native context length of 262,144 tokens. (huggingface.co) Qwen’s published benchmarks show 73.4 on SWE-bench Verified, 51.5 on Terminal-Bench 2.0, 37.0 on MCPMark, 86.0 on Graduate-Level Google-Proof Question Answering, and 92.7 on the 2026 American Invitational Mathematics Examination set. (qwen.ai) (huggingface.co) Those scores matter because the current market for “agentic” models is less about chat fluency than about finishing multi-step work inside code repositories, terminals, and tool-using workflows. Qwen said the update focused on front-end workflows, repository-level reasoning, and a feature it calls “thinking preservation” across earlier messages. (github.com) (huggingface.co) The hardware side of the story is about memory traffic, not just raw chip size. Apple says MLX, its open-source framework for Apple silicon, uses unified memory and can run model operations on CPU or GPU without shuttling data between separate memory pools. (machinelearning.apple.com) Apple said in a November 19, 2025 research post that the M5 chip added “Neural Accelerators” for matrix multiplication, a core operation in large language model inference, and that MLX now uses them on the new M5 systems. (machinelearning.apple.com) Community benchmark posts circulating this month reported an M5 Max beating an M3 Ultra on MLX inference for a 4-bit Qwen mixture-of-experts model, an outcome that points to software path and memory behavior as much as headline core counts. The public benchmark repository tied to those tests is a community project, not an Apple or Qwen release. (gist.github.com) (github.com) Alibaba has not open-sourced every Qwen3.6 model. The Information reported on April 17 that the company released a smaller open-source version while shifting more attention toward proprietary models that could bring in revenue. (theinformation.com) Put together, the release and the Apple-side tests sketch the same operating model: route each token through fewer parameters, keep memory close to the chip, and run more of the workload on a developer’s own machine. (qwen.ai) (machinelearning.apple.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.