M4 Neural Engine Internals Reverse-Engineered
An engineer has reverse-engineered Apple's M4 Neural Engine, revealing details about its extreme efficiency for on-device ML. The analysis, which is getting significant attention, also highlights the performance overhead from CoreML. The findings offer a rare look into Apple's silicon, providing key insights for developers optimizing on-device AI applications.
The work, a collaboration between developer Manjeet Singh and the AI Claude Opus, bypassed Apple's official CoreML framework to directly interface with the ANE's private APIs. This allowed for the first-ever training of a neural network, a 109M-parameter Llama2-architecture transformer, on hardware Apple has exclusively designated for inference tasks. The project successfully mapped over 40 private Objective-C classes to the IOKit kernel driver to achieve this. Apple markets the 16-core M4 Neural Engine at 38 trillion operations per second (TOPS), but this figure is based on INT8 precision and can be misleading. The reverse-engineering analysis revealed the true peak performance is 19 TFLOPS with FP16 precision, and that INT8 offers memory bandwidth savings rather than a computational speedup. This deep dive provides a more accurate performance ceiling for developers working with 16-bit floating-point models. A key finding is the significant performance cost of CoreML, which can introduce 2-4 times the overhead for small operations compared to direct API access. The dispatch overhead from XPC and IOKit alone is approximately 0.095ms per operation, a critical insight for developers building latency-sensitive applications like real-time LLM token decoding. The analysis also uncovered architectural details, such as the ANE being fundamentally a convolution engine, meaning that structuring computations as 1x1 convolutions can yield much higher throughput than traditional matrix multiplication. Furthermore, the research pinpointed an approximate 32MB of on-chip SRAM; exceeding this capacity forces spills to DRAM and can cut throughput by 30%. At a peak power draw of just 2.8 watts, the M4 ANE achieves a remarkable 6.6 TFLOPS per watt. This is roughly 80 times more power-efficient per FLOP than an NVIDIA A100 datacenter GPU, highlighting the ANE's design for sustained, battery-conscious, on-device AI. While this breakthrough opens new possibilities, it relies on private APIs that could be broken by any macOS update. The current method also faces CPU bottlenecks and an inefficient process of recompiling the model for every training batch, making it impractical for production training but invaluable for understanding the hardware's true potential.