Alibaba LLM Now Runs On-Device on iPhone
Alibaba's Qwen 3.5 model is now running fully on-device on the iPhone 17 Pro. The optimized 2-billion parameter model is reportedly outperforming larger models in visual understanding, demonstrating significant progress in local, on-device AI capabilities.
Alibaba's Qwen model family, known as Tongyi Qianwen, includes a wide range of open-weight large language models. The Qwen 3.5 small model series, with parameter counts from 0.8B to 9B, is specifically designed for on-device applications, prioritizing computational efficiency. This push towards smaller, more efficient models reflects an industry trend of "More Intelligence, Less Compute." The 2-billion parameter model is part of a series optimized for high-throughput, low-latency tasks on edge devices like smartphones. These smaller models are made compatible with mobile chips by optimizing the dense token training process to reduce their VRAM footprint. The Qwen2 series, which the on-device model belongs to, utilizes a Transformer architecture with enhancements like SwiGLU activation and group query attention for efficiency. On-device performance benchmarks for the iPhone 17 Pro with its A19 Pro chip show significant speed improvements for local LLMs. While specific numbers for the Qwen 3.5 2B model aren't public, a 4-bit quantized Qwen3 0.6B model has been clocked at nearly 70 tokens per second on the device. The A19 Pro's inclusion of neural accelerators within its GPU cores is a key factor in this performance leap. The visual understanding capabilities highlighted stem from a move towards native multimodality in models like Qwen-VL. Instead of using adapters to connect separate vision and language models, this native approach processes visual and text tokens together from early training stages. This results in improved spatial reasoning and optical character recognition compared to adapter-based systems. Alibaba's broader strategy with the Qwen series involves aggressive scaling in multiple dimensions, including plans for models with up to ten trillion parameters and context lengths of 100 million tokens. The open-sourcing of many Qwen models on platforms like Hugging Face has led to over 40 million downloads, fostering widespread adoption and development.