ANE pushed to training limits

Developers demonstrated a full 48.8M‑parameter GPT pretrained from scratch on the Apple Neural Engine (using Rust), while an Apple‑Silicon startup showed on‑device voice AI + RAG and independent benchmarks reported ~5× speed over CoreML on ANE. Together those tests point to real headroom for on‑device training and inference—and renewed calls for Apple to optimize ANE tooling and SDKs.

ncdrone’s "train-my-mac" GitHub repo documents a pipeline that trains transformer models from scratch on macOS by driving the ANE via native Objective‑C private APIs and MLX, and the README explicitly references bf16 support and coordinating ANE + GPU accelerators. (github.com) A community Rust effort, nktkt/ane‑rust, rewrites the original Objective‑C ANE tooling into ~9,600 lines of Rust with a ~370‑line Objective‑C bridge to reach private ANE entry points, demonstrating active tooling work in Rust for direct ANE access. (github.com) Orion, an arXiv preprint (submitted Mar 6, 2026 as arXiv:2603.06728), presents an end‑to‑end system that bypasses CoreML by calling Apple’s private _ANEClient and _ANECompiler APIs and shows stable multi‑step training with checkpoint resume on ANE silicon. (arxiv.org) RunAnywhereAI’s RCLI repo publishes a macOS on‑device voice AI stack (STT + LLM + TTS) with local RAG and claims sub‑200ms end‑to‑end latency and 43 voice actions, while Christopher Karani’s "Espresso" codebase demonstrates compiling MIL programs directly to ANE and managing IOSurface buffers for KV cache state—both projects illustrate practical end‑to‑end pipelines that bypass CoreML for higher ANE utilization. (runanywhereai.github.io)

ANE pushed to training limits

Get your own daily briefing