Apple ANE Trained

A reverse‑engineering effort unlocked full forward/backprop training on Apple’s Neural Engine (M4) — a 109M‑parameter model runs at 91 ms/step and a 596M (Qwen3‑0.6B) build at 412 ms, with INT8 delivering a 1.88× speedup. The project’s GitHub has 6.3K stars and the reveal post pulled 3,853 likes and 261K views, signaling real momentum for on‑device model training on Apple silicon. (x.com)

The public lead on the project is the GitHub user "maderix" (Manjeet Singh), who published the repository and has posted follow-up writeups on his Substack under the same handle. ( · ) The codebase communicates with previously undocumented private frameworks—specifically reverse-engineering _ANEClient and _ANECompiler and the ANE's MIL (Model Intermediate Language)—to emit custom compute graphs that perform both forward and backward passes on the ANE. ( · ) Independent writeups and the repo’s benchmark reports show microbenchmarks claiming sustained ANE throughput in the multi-teraflop range and a reported 9.3 ms microstep on an M4-configured machine for a specific workload. ( · ) The project demonstrates training a transformer implementation derived from Llama‑style architectures and includes a multi-model dashboard with evaluation hooks for GQA datasets and Weights & Biases integration. ( · ) Repository activity shows rapid iteration: commits in early March 2026 added probe/telemetry tooling for M5 optimization and a bridge that introduced INT8 W8A8 support to the codepath. ( · ) The work has drawn mainstream attention and community discussion—coverage across ML blogs and a Hacker News thread surfaced after the release—while the GitHub project has accumulated thousands of stars and multiple forks since its first push. ( · )

Apple ANE Trained

Get your own daily briefing