Alibaba's Qwen 3.5 Models Optimized for MLX

Alibaba's compact Qwen 3.5 AI models are now optimized for Apple Silicon via the MLX framework. The 9-billion parameter models can run locally on Macs, iPhones, and iPads, enabling advanced AI tasks without sharing data to the cloud and expanding the range of powerful LLMs available for on-device applications.

Apple's MLX framework is engineered to directly leverage the Unified Memory Architecture of its M-series chips, eliminating the need for data to be copied between CPU and GPU memory. This zero-copy operation significantly reduces latency, a key advantage for the iterative development and testing workflows common in iOS and macOS development. The framework's design, inspired by PyTorch and JAX, also includes a NumPy-like API, making it familiar for machine learning researchers and developers. The Qwen 3.5 small model series is a family of large language models from Alibaba, with sizes ranging from 0.8B to 9B parameters, designed specifically for on-device applications. The 9-billion parameter model, in particular, has been tuned to offer reasoning and logic capabilities comparable to much larger 30B+ parameter models. This focus on "More Intelligence, Less Compute" makes it a strong candidate for integration into applications on the Apple ecosystem. Performance benchmarks for models running on MLX show a significant throughput advantage compared to other frameworks on Apple Silicon. For instance, the Qwen3.5-35B-A3B model has been reported to run approximately 1.8 times faster on MLX than on Ollama, which uses the llama.cpp backend. Community-reported benchmarks for Qwen 3 models on M-series chips with MLX have shown speeds exceeding 100 tokens per second on an M4 Max. The optimization of Qwen 3.5 for MLX allows for practical, high-performance on-device AI without constant cloud dependency. Community tests have shown the 2B model running smoothly on an iPhone 17 Pro, achieving 30-50 tokens per second, which is on par with cloud API response times but without the network latency. This enables a range of offline-first, privacy-centric features in mobile applications, from real-time text analysis to more advanced, visually-aware agentic workflows with the 4B multimodal variant. For developers, the combination of Qwen 3.5 and MLX opens up possibilities for sophisticated on-device fine-tuning and inference. The framework supports quantization, which can reduce a model's memory footprint by up to 75% with 4-bit quantization, making it feasible to run these powerful models on a wider range of Apple devices. This allows for the creation of more intelligent and responsive applications that respect user privacy by keeping data on the device. The underlying technology that powers MLX's performance on Apple Silicon is the Metal framework. By building on Metal, MLX can efficiently utilize the GPU for machine learning tasks, mapping computational graphs to Metal Performance Shaders. This deep integration with Apple's hardware is what gives MLX its performance edge for models like Qwen 3.5.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.