Espresso smokes Core ML
A developer benchmark shows the Espresso inference framework running about 4.7× faster on Apple’s Neural Engine — roughly 1.08ms/token vs ~5.09ms/token for Core ML in the posted test. chris_karani.
“Espresso” here refers to Apple’s private C++ inference runtime that underpins Core ML model execution on-device, not an independent third‑party project. stackoverflow.com Core ML actually dispatches work to three engines implemented inside Espresso — the ANE engine for the Neural Engine, MPSEngine for the GPU, and BNNS/CPU engines — and it can split a single model across those engines at runtime. github.com A recent systems paper called Orion showed researchers bypassing Core ML by using Apple’s private _ANEClient and _ANECompiler APIs to run models directly on the ANE and published an end‑to‑end pipeline claiming improved LLM inference performance by using those private paths. arxiv.org Reverse‑engineering projects and community tooling (for example mdaiter/ane) document Espresso/ANE internals, note that ANE access can require entitlements such as com.apple.developer.coreml.neural-engine-access, and surface silent failures and per‑operation runtime choices that affect whether—and how fast—models actually run on the Neural Engine. github.com