Private ANE APIs → 4.7× token speed
Developers experimenting with private Apple Neural Engine (ANE) APIs reported a ~4.7× inference speedup over Core ML — 1.08ms/token vs 5.09ms/token — by optimizing buffer management and stateful model execution on ANE reported reported. That magnitude of improvement matters for pushing medium-sized LLMs onto iPhone-class hardware with practical latency budgets.
Initial posts came from independent developer Chris Karani x.com and the ANEMLL project account/repo x.com, with both threads linking to code and benchmarking write‑ups rather than commercial announcements. One implementation compiles MIL programs straight to ANE silicon and explicitly manages IOSurface-backed KV‑cache buffers in a repo named Espresso github.com, while the ANEMLL project documents IOSurface-backed ring/ping‑pong buffers plus a serial prediction queue to eliminate ANE race conditions on iOS github.com. A concurrent systems paper, Orion, shows an end‑to‑end pipeline that bypasses Core ML by invoking private _ANEClient and _ANECompiler interfaces for direct ANE execution and multi‑step checkpointed training arxiv.org, and ANEMLL publishes converted Llama‑3.2 3B artifacts for ANE inference on Hugging Face huggingface.co. Apple’s published guidance still positions Core ML as the supported on‑device ML framework developer.apple.com, developers have reported stateful‑API alignment failures for ANE state tensors on the Apple Developer Forums developer.apple.com, and multiple reverse‑engineering projects claiming direct ANE access or backpropagation (e.g., maderix/ANE, LRSnowX/ane) have appeared on GitHub in recent days github.com.