Apple Neural Engine speedup

A compilation approach to Apple Neural Engine inference reportedly yields a 4.7x speedup over CoreML by compiling weights and dispatch graphs once — a practical win for shipping local AI demos and production on Apple Silicon. That kind of efficiency matters for exec demos and on‑device feature roadmaps where latency and privacy are selling points reported.

Espresso’s repo reports per‑token latency of 1.08 ms and throughput of 926 tok/s on a 6‑layer transformer running on an M3 Max (macOS 15), compared with CoreML’s 5.09 ms/token and 196 tok/s on the same model. github.com The implementation compiles MIL programs directly to ANE using reverse‑engineered private APIs named _ANEClient and _ANEInMemoryModel, and achieves fused multi‑layer kernels plus IOSurface zero‑copy I/O as part of its performance strategy. github.com Independent reverse‑engineering work (maderix/ANE) similarly exposes the private _ANEClient/_ANECompiler surface and documents the ANE’s real behavior and constraints, with the author explicitly labeling the project research rather than production code in the README. github.com Orion’s paper quantifies a practical mitigation: delta compilation cuts a full ANE recompilation from ≈4,200 ms to 494 ms (8.5×) and reports a 3.8× total training speedup plus 170+ tokens/s for GPT‑2 124M on an M4 Max. arxiv.org A concise three‑slide exec frame maps directly to the technical facts: Slide 1 — hard metrics (Espresso’s 1.08 ms/token and 926 tok/s; Orion’s 170+ tok/s on M4 Max) github.com; Slide 2 — risks (private API dependence and ANE’s weight‑at‑compile behavior, ~4.2 s recompiles) github.com; Slide 3 — the ask (timeboxed 6‑week pilot to validate compile‑once caching and delta compilation with success criteria: ≤1.5 ms/token or ≥200 tok/s). Reproducibility details for an engineering demo are embedded in Espresso: the repo contains a reproduce script (scripts/reproduce_local_real_artifact_claim.sh) and a benchmark workflow that writes machine‑readable results under artifacts/benchmarks for the exact 6‑layer, dim=768, seqLen=256 test on M3 Max. github.com

Apple Neural Engine speedup

Get your own daily briefing