LLM Hits 56 Tokens/Sec on Neural Engine
A developer has successfully run Alibaba's Qwen 3.5 0.8B model directly on an M4 Pro's Neural Engine, achieving ~56 tokens/second. The demo uses the open-source ANEMLL library with CoreML and Swift, showcasing a significant step forward for on-device AI performance.
The M4 Pro's 16-core Neural Engine provides the dedicated hardware acceleration necessary for such performance, capable of 38 trillion operations per second on the base M4 chip. This specialized silicon is designed to execute machine learning models with greater power efficiency than running them on the CPU or GPU. The open-source ANEMLL library serves as a crucial bridge, specifically designed to convert Hugging Face models into the CoreML format. Its goal is to create a complete open-source pipeline that optimizes common LLM architectures for on-device inference on the Apple Neural Engine (ANE). Alibaba's Qwen2 0.5B is a compact, Transformer-based model with 500 million parameters, making it well-suited for resource-constrained environments. The model was pretrained on a large multilingual dataset and supports a context length of up to 32K tokens. Achieving this requires clever engineering, as the Neural Engine has a fixed set of operations. For unsupported functions like RMS Normalization, common in many modern LLMs, ANEMLL employs "hacks" that manipulate input data so that the ANE's native LayerNorm operation produces the correct result. Using the Neural Engine via the Core ML framework ensures that processing happens on-device, which enhances privacy and allows applications to remain responsive without a network connection. This approach minimizes both memory footprint and power consumption, critical for performance on battery-powered hardware. Core ML Tools further optimize models for the ANE through techniques like weight palettization and 8-bit quantization. The latest M4 hardware includes a faster int8-int8 compute path, which can yield significant latency benefits for properly quantized models.