Developer Trains Model on Neural Engine

A developer has reportedly succeeded in training a neural network directly on Apple's Neural Engine — a feat Apple previously claimed was impossible. The breakthrough was achieved using custom Metal shaders, bypassing standard Core ML constraints. This could open the door to fully on-device model training, not just inference.

The breakthrough was achieved by developer Manjeet Singh, who reverse-engineered Apple's private frameworks to gain direct access to the Neural Engine hardware. This work bypasses the public CoreML framework, which Apple has restricted to inference-only tasks. Singh's open-source project maps over 40 private classes, including `_ANEClient` and `_ANECompiler`, to the IOKit kernel driver. This allows for in-memory model compilation, a critical step for training as weight updates require recompilation after each step. Initial tests on an M4 chip, training a single transformer layer, achieved a sustained performance of 1.78 TFLOPS with 11.2% ANE utilization. For comparison, the M4's ANE is marketed with a peak performance of 38 TOPS, though its real FP16 throughput is closer to 19 TFLOPS. The research also revealed that the ANE's core computational primitive is convolution, not matrix multiplication. Expressing matrix multiplication as 1x1 convolutions can yield a throughput improvement of approximately 3x, a significant optimization for future on-device ML development. This project successfully runs both forward and backward passes on the ANE, with the Adam optimizer and weight gradient calculations handled by the CPU. The code, available on GitHub under an MIT license, currently supports single-layer training with synthetic data and has known resource leaks that require a process restart after about 119 compilations.

Developer Trains Model on Neural Engine

Get your own daily briefing