Espresso bypasses Core ML
Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup. He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).
Karani’s public repository labels) Espresso as “Backpropagation and exact token generation on Apple’s Neural Engine” implemented via reverse‑engineered private ANE APIs, and publishes code to exercise those paths. Independent academic work named Orion documented) an end‑to‑end system that bypasses Core ML by invoking Apple’s private ANEClient and ANECompiler APIs to run LLM inference and resumable multi‑step training directly on the ANE. Apple’s own ML research pages state) that on‑device models are tuned for Apple silicon, while the company’s M5 announcement specifies) a 16‑core Neural Engine plus Neural Accelerators in the GPU and a unified memory bandwidth increase to 153 GB/s—hardware changes that change the ANE performance envelope. Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.