Apple M4 ANE Performance Uncovered

Developers have reverse-engineered Apple's M4 Neural Engine, finding its real-world FP16 throughput is 19 TFLOPS, well below the marketed "38 TOPS." However, its efficiency is an incredible 6.6 TFLOPS/W — reportedly 80 times better than an Nvidia A100.

The "38 TOPS" figure for the M4's Neural Engine is a marketing convention, achieved by doubling its actual 19 TFLOPS of FP16 performance. Benchmarks reveal the ANE hardware doesn't execute INT8 operations twice as fast; it dequantizes INT8 weights to FP16 before computation, meaning INT8 only offers memory bandwidth savings, not a compute speedup. Deep dives into the ANE's architecture show it's fundamentally a convolution engine, making 1x1 convolutions significantly more efficient than standard matrix multiplications. Performance is heavily dependent on its ~32 MB of on-chip SRAM; workloads exceeding this size spill to DRAM, causing a performance drop of around 30%—a phenomenon termed the "SRAM cliff". To achieve the ANE's peak throughput, developers must chain 16-64 operations in a single program, as single operations can waste up to 70% of its capacity. The M4 chip is manufactured using TSMC's second-generation 3nm process (N3E), an enhancement over the N3B process used for the M3 series. This improved node allows for 28 billion transistors on the base M4, a 12% increase from the M3's 25 billion. This fabrication advancement is a key enabler for the chip's gains in performance and efficiency. The extreme power efficiency of the ANE is enabled by hard power gating, allowing it to shut down completely and consume zero milliwatts when idle, a significant advantage over simple clock-gating. While its raw throughput doesn't compare to data center GPUs like the A100, its performance-per-watt is transformative for battery-powered, on-device AI. Bypassing Apple's standard CoreML framework provides a significant performance boost for latency-sensitive tasks. Direct access to the ANE's private APIs eliminates the 2-4x overhead CoreML can introduce on small operations, a crucial factor for workloads like real-time inference. This highlights a software bottleneck for teams aiming to maximize hardware utilization. Apple plans to integrate variants of the M4—including the more powerful Pro and Max versions—across the entire Mac product line, from the iMac and Mac mini to the MacBook Pro and Mac Studio. This signals a strategic push to standardize the new AI-focused architecture and its associated neural engine capabilities across all hardware segments.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.