Apple Neural Engine Demo Shows Fast On-Device LLM Speeds

A new demo shows a small language model running at an impressive ~56 tokens per second on an Apple M4 Pro's Neural Engine. The demonstration used advanced LUT6 quantization, hinting at future on-device models that could leverage a hybrid ANE and GPU approach for more complex tasks.

The use of 6-bit Look-Up Table (LUT) quantization is a key technical detail, offering a sweet spot between the aggressive compression of 4-bit and the higher accuracy of 8-bit formats. Research indicates that 6-bit quantization can achieve near-lossless model quality while significantly reducing the memory footprint and computational cost, making it ideal for the thermal and power constraints of on-device processing. This method replaces computationally expensive multiplication operations with efficient look-up table accesses, a hardware-friendly approach that can boost performance on specialized silicon. The reported speed of ~56 tokens per second on an M4 Pro is noteworthy when compared to other on-device benchmarks. For instance, a 64GB M4 Pro running a 32B parameter model with 4-bit quantization can achieve around 11-14 tokens per second. While the model size in this demo is unspecified, the higher token rate suggests either a smaller, highly optimized model or significant gains from the LUT6 method on the Neural Engine. The hybrid ANE and GPU approach is a strategic move to optimize power and performance by intelligently distributing the workload. In LLM inference, this could involve the ANE managing the Key-Value (KV) cache, which stores past interactions, while the GPU handles the more parallelizable computations of the attention mechanism. This division of labor prevents memory bandwidth from becoming a bottleneck, a common issue in transformer-based models. This demonstration aligns with Apple's broader strategy of vertical integration and on-device AI, prioritizing privacy, low latency, and offline capability. By developing both the custom silicon and the software frameworks like CoreML and MLX, Apple can tightly couple model optimization with the specific architecture of the Neural Engine. This contrasts with competitors who often rely more heavily on cloud-based processing for complex AI tasks. Recent reverse-engineering efforts by the developer community have started to shed light on the ANE's architecture, revealing a multi-stage pipeline design that benefits from chained operations. Understanding these hardware-level details allows for more effective optimization of models for on-device execution, bypassing some of the abstraction layers of CoreML for maximum performance. Looking ahead, this hybrid processing model is likely a foundational element of Apple's roadmap for more complex on-device AI. Future silicon, such as the rumored M5, is expected to further enhance AI processing capabilities with features like Neural Accelerators within each GPU core. This indicates a continued focus on building out the hardware capabilities to support more sophisticated generative AI features directly on Apple devices.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.