Espresso bypasses Core ML

Published March 14, 2026 by The Daily Scout

Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup. He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).

Why it matters

Karani’s public repository labels) Espresso as “Backpropagation and exact token generation on Apple’s Neural Engine” implemented via reverse‑engineered private ANE APIs, and publishes code to exercise those paths. Independent academic work named Orion documented) an end‑to‑end system that bypasses Core ML by invoking Apple’s private ANEClient and ANECompiler APIs to run LLM inference and resumable multi‑step training directly on the ANE. Apple’s own ML research pages state) that on‑device models are tuned for Apple silicon, while the company’s M5 announcement specifies) a 16‑core Neural Engine plus Neural Accelerators in the GPU and a unified memory bandwidth increase to 153 GB/s—hardware changes that change the ANE performance envelope. Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.

Key numbers

Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup.
He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).
Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.

Sources

Quick answers

What happened in Espresso bypasses Core ML?

Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup. He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).

Why does Espresso bypasses Core ML matter?

Karani’s public repository labels) Espresso as “Backpropagation and exact token generation on Apple’s Neural Engine” implemented via reverse‑engineered private ANE APIs, and publishes code to exercise those paths. Independent academic work named Orion documented) an end‑to‑end system that bypasses Core ML by invoking Apple’s private ANEClient and ANECompiler APIs to run LLM inference and resumable multi‑step training directly on the ANE. Apple’s own ML research pages state) that on‑device models are tuned for Apple silicon, while the company’s M5 announcement specifies) a 16‑core Neural Engine plus Neural Accelerators in the GPU and a unified memory bandwidth increase to 153 GB/s—hardware changes that change the ANE performance envelope. Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.

Espresso bypasses Core ML

What happened

Why it matters

Key numbers

Sources

Quick answers

What happened in Espresso bypasses Core ML?

Why does Espresso bypasses Core ML matter?

Get your own daily briefing