Ollama Optimizes Qwen 3.5

Ollama announced an update that optimizes Qwen 3.5 for Apple Silicon via Metal acceleration, improving local inferencing performance on Macs and potentially mobile‑class devices. The move underscores continued tooling work to get frontier models running efficiently on device. (x.com)

Ollama’s model library lists Qwen 3.5 builds from 0.8B through 122B, with many variants published alongside a 256K context window designation. (ollama.com) Independent speed tests report a Mac mini M4 (16 GB) running Ollama can produce roughly 8–15 tokens per second on Qwen 3.5 7B-class models, outperforming comparable M2 results in the same benchmark set. (runaiguide.com) Community tooling and Ollama’s Metal work route inference into Apple’s GPU via Metal/MPS; MLX explicitly uses a Metal backend and unified memory to eliminate explicit CPU↔GPU transfers and has been shown to yield roughly 2× improvements in some Apple‑Silicon setups. (dev.to) Architectural notes on Qwen 3.5 show Medium and A‑series variants reduce active parameter footprint (the 35B‑A3B reports ~3B active params), with published working‑set RAM figures in the ~16–22 GB range for several 27B–35B builds—figures that materially lower the bar for running larger models on desktop Apple Silicon. (modelfit.io) During the Qwen 3.5 rush, observers reported Ollama Cloud outages while the local Ollama client continued to run models for users, highlighting the operational value of optimized on‑device inference for availability. (youtube.com) Ollama’s documentation and how‑to guides show CLI model tags like qwen3.5:latest and qwen3.5:9b plus example API endpoints, enabling scripted local deployment, model switching, and integration with local RAG/agent pipelines. (ollama.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.