Apple's Neural Engine Cracked for Local AI Training
Developers have successfully cracked Apple's on-device Neural Engine, enabling full model training directly on ANE silicon. Benchmarks show the M4 chip achieving 1.78 TFLOPS, making it 80 times more power-efficient than an NVIDIA A100 GPU. This breakthrough could turn MacBooks into portable AI supercomputers for local model development.
The breakthrough was led by developer Manjeet Singh, who reverse-engineered the ANE's private, undocumented APIs to enable full neural network training. This was achieved by bypassing Apple's CoreML framework, which traditionally restricts the Neural Engine to inference tasks only. Singh's open-source project on GitHub provides direct access to the ANE's hardware, demonstrating backpropagation running on the silicon for the first time. A key finding from this work is the ANE's actual performance versus its marketed specifications. While Apple advertises the M4 ANE at 38 TOPS (trillions of operations per second), its true peak throughput is 19 TFLOPS with FP16 precision. The chip doesn't gain a speed advantage from using INT8 integers; it dequantizes them to FP16 before computation, meaning the 38 TOPS figure is a theoretical marketing number. The ANE's architecture is fundamentally optimized for convolutions, not the matrix multiplications typical of GPUs and TPUs. Singh discovered that reformulating operations as 1x1 convolutions can triple throughput. Benchmarking also revealed approximately 32MB of undisclosed on-chip SRAM, which is crucial for developers to know for performance optimization. Despite the breakthrough, practical training on the ANE is still in its early stages. The CPU handles a significant portion of the workload, including loss computation and optimizer updates, creating a bottleneck that is 10 times slower than the ANE's processing time. Furthermore, the reliance on private APIs means that any macOS update from Apple could break this functionality. This development has significant implications for MLOps and on-device AI. The ANE's power efficiency of 6.6 TFLOPS per watt makes it vastly more efficient than datacenter GPUs for specific tasks. For industries like insurance and retail, this could enable more powerful, privacy-preserving models for risk assessment and personalization to run directly on edge devices. For actuaries and underwriters, on-device AI could transform risk modeling by allowing for real-time analysis of sensitive data without it ever leaving the device, enhancing both privacy and the speed of decision-making. In consumer fashion, this technology could power sophisticated, on-device recommendation systems and virtual try-on experiences that are faster and more personalized. The New York City tech scene, a hub for applied AI, is poised to leverage such advancements. Local startups in fields from fintech to healthcare could develop novel applications that take advantage of powerful, efficient, on-device processing. Companies like Clarifai and AlphaSense are already pushing the boundaries of AI in the city.