Apple M5 GPUs Add Tensor Cores, Boosting On-Device AI
A deeper dive into Apple's new M5 silicon reveals that the GPUs now feature tensor cores for the first time. This hardware upgrade is expected to deliver a 4x increase in fp16 performance, with 8x for fp8 and a potential 16x for fp4 rates on the horizon. The enhancement allows new open-source LLMs running locally on 128GB M5 Max configurations to rival the performance of cloud-based models, significantly advancing on-device AI capabilities.
The introduction of tensor cores is a strategic departure from relying solely on the Apple Neural Engine (ANE) for AI acceleration. First introduced in the A11 Bionic chip in 2017 with two cores, the ANE has evolved to a 16-core design in recent chips, focusing on low-power inference for features like Face ID. This new hardware signals a dedicated push to accelerate a wider range of machine learning workloads directly within the GPU. This architectural shift mirrors a broader industry trend, moving beyond general-purpose GPU computing for AI. While the ANE is optimized for the efficiency needed in mobile applications, tensor cores are explicitly designed for the mixed-precision matrix multiplication that forms the foundation of transformer models and other complex AI tasks. This specialization allows for significantly higher throughput on specific AI computations. The move to lower-precision formats like FP8 and FP4 is critical for on-device AI, as it drastically reduces the memory footprint and computational load of large models. Competitors like NVIDIA have already integrated support for these formats in their recent architectures, demonstrating their importance for both training and inference. Apple's adoption of tensor cores capable of these formats is a direct response to this high-performance computing trend. By integrating tensor cores, Apple is better positioning its hardware to compete with NVIDIA's dominance in the machine learning space. While the ANE has been highly effective for specific, curated experiences, the programmability of GPU-based tensor cores offers developers more direct control and flexibility. This allows for greater optimization of novel and complex AI models that may not be well-suited for the ANE's architecture. The strategic focus on on-device processing addresses key market demands for privacy, latency, and cost-efficiency. Running powerful AI models locally avoids the need for constant cloud connectivity and associated server costs, a significant advantage in regulated industries and for applications requiring real-time responsiveness. This hardware enhancement strengthens Apple's vertical integration, creating a more powerful and secure ecosystem for developers and users.