Espresso bypasses Core ML

Published by The Daily Scout

What happened

Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup. He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).

Why it matters

Karani’s public repository labels) Espresso as “Backpropagation and exact token generation on Apple’s Neural Engine” implemented via reverse‑engineered private ANE APIs, and publishes code to exercise those paths. Independent academic work named Orion documented) an end‑to‑end system that bypasses Core ML by invoking Apple’s private ANEClient and ANECompiler APIs to run LLM inference and resumable multi‑step training directly on the ANE. Apple’s own ML research pages state) that on‑device models are tuned for Apple silicon, while the company’s M5 announcement specifies) a 16‑core Neural Engine plus Neural Accelerators in the GPU and a unified memory bandwidth increase to 153 GB/s—hardware changes that change the ANE performance envelope. Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.

Key numbers

  • Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup.
  • He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).
  • Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.

Quick answers

What happened in Espresso bypasses Core ML?

Developer Christopher Karani demoed “Espresso,” a path that bypasses Core ML to hit the Neural Engine directly — delivering ~1.08ms/token versus 5.09ms with the standard path, a ~4.7x speedup. He posted follow-up benchmarks showing there's still headroom for CoreAI improvements on Apple [Silicon [follow-up]](https://x.com/i/status/2032475954317054407).

Why does Espresso bypasses Core ML matter?

Karani’s public repository labels) Espresso as “Backpropagation and exact token generation on Apple’s Neural Engine” implemented via reverse‑engineered private ANE APIs, and publishes code to exercise those paths. Independent academic work named Orion documented) an end‑to‑end system that bypasses Core ML by invoking Apple’s private ANEClient and ANECompiler APIs to run LLM inference and resumable multi‑step training directly on the ANE. Apple’s own ML research pages state) that on‑device models are tuned for Apple silicon, while the company’s M5 announcement specifies) a 16‑core Neural Engine plus Neural Accelerators in the GPU and a unified memory bandwidth increase to 153 GB/s—hardware changes that change the ANE performance envelope. Apple’s coremltools documentation describes) W8A8 and INT4 quantization modes and an int8‑int8 compute path for newer chips, giving a concrete software optimization route that complements direct‑ANE approaches like Espresso and Orion.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.