Apple Silicon running LLMs locally
Several recent demos and open repos show serious on-device LLM work—fine-tuning and inference optimised for Apple Neural Engine—with models like Gemma 4 running at high token rates on iPhone 17 Pro and M5-class Mac hardware ( ). There are practical scale examples too: a 48‑Mac Mini rack used to transcribe podcasts locally demonstrates that clustered Apple Silicon can replace some cloud workflows for certain workloads (x.com). Together these signals show on-device and local-cluster ML is moving from hobbyist proof-of-concept into usable throughput for pro workflows (x.com).
Apple Silicon is turning local artificial intelligence from a demo into a workflow A year ago, “run it locally” usually meant a toy chatbot on a laptop. In April 2026, the picture looks different: Apple’s own machine learning stack now supports both inference and fine-tuning on Apple silicon, open projects are targeting the Apple Neural Engine directly, and recent demos show Google’s Gemma 4 running offline on hardware as small as an iPhone 17 Pro. A separate production example goes further: Overcast developer Marco Arment says a rack of 48 Mac minis is now generating podcast transcripts without relying on cloud artificial intelligence services. (machinelearning.apple.com) To understand why that matters, start with what a large language model actually does on a computer. A model is mostly a huge pile of numbers called parameters, and every answer requires the machine to multiply and combine those numbers over and over. That work is less like opening a file and more like doing millions of tiny spreadsheet operations in parallel, which is why specialized chips matter so much. (machinelearning.apple.com) The first bottleneck is memory. A model has to fit somewhere while it runs, and moving its weights back and forth between separate memory pools wastes time and power. Apple silicon uses unified memory, which lets the central processor and graphics processor work from the same pool instead of copying data around, and Apple says MLX is built to take advantage of that design for training and inference. (machinelearning.apple.com) The second bottleneck is math throughput. Apple’s Neural Engine, and now the Neural Accelerators Apple added to the M5 graphics processor path, are designed to speed up matrix multiplication, the exact kind of repeated arithmetic that drives modern model inference. Apple says the latest macOS beta lets MLX use those Neural Accelerators on M5 systems, which is a sign that local model performance is becoming a first-class target rather than a side effect of fast consumer hardware. (machinelearning.apple.com) The software layer matters as much as the chip. Apple describes MLX as an open-source array framework tuned for Apple silicon, with support for neural network training and inference, while MLX LM sits on top to run and fine-tune language models from repositories such as Hugging Face. Apple also highlights built-in quantization, which shrinks models by storing parameters at lower precision so they fit into less memory and run faster on local devices. (machinelearning.apple.com) That compression step is one reason “local” is suddenly more plausible. Google’s Gemma 4 family, released March 31, 2026, spans smaller E2B and E4B models for mobile and edge use, plus larger 26B A4B and 31B versions for heavier hardware. Google says Gemma 4 supports more than 140 languages, offers up to a 256K-token context window, and is designed for on-device and edge deployment under an Apache 2.0 license. (developers.googleblog.com) That is the backdrop for the recent iPhone and Mac demos circulating this week. One widely discussed example, referenced on Google’s developer forum on April 5, 2026, describes Gemma 4 E2B running fully offline on an iPhone 17 Pro at about 40 tokens per second with MLX optimization for Apple silicon. Even allowing for the usual caution around social-media benchmarks, that speed is well past “proof of life” territory and into the range where a phone can feel conversational instead of sluggish. (discuss.ai.google.dev) On the Mac side, Apple’s own November 2025 research note already pointed in the same direction. The company said MLX on the newest macOS beta can use the M5 chip’s Neural Accelerators and explicitly framed the stack as a way for researchers to try new inference and fine-tuning techniques privately on their own hardware. That is a notable shift in tone: Apple is not just pitching local models as a consumer privacy feature, but as a developer and research workflow. (machinelearning.apple.com) Open-source developers are pushing even harder. The GitHub project NeuralForge presents itself as a macOS app for fine-tuning transformer models directly on a Mac using the Apple Neural Engine, built on top of a reverse-engineered AppleNeuralEngine framework implementation. Its README lists local training, Low-Rank Adaptation fine-tuning, quantization, export to multiple runtimes, and even distributed training across multiple Macs. (github.com) That last detail is important because it connects the laptop story to the small-cluster story. If a single Mac can handle useful inference and a phone can run a compact model offline, then a room full of inexpensive Apple desktops starts to look like a specialized local compute cluster rather than a pile of consumer gadgets. The economics change when the workload is steady and predictable. (forums.appleinsider.com) Marco Arment’s Overcast setup is the clearest real-world example so far. According to AppleInsider’s April 7, 2026 report, a rack of 48 Mac minis now powers podcast transcripts for Overcast, after Arment decided cloud pricing would have cost thousands of dollars per day. Instead of sending every transcription request to a metered artificial intelligence application programming interface, Overcast runs Apple speech recognition models on its own Apple silicon backend and spreads jobs across the cluster. (forums.appleinsider.com) The details of that workload make the choice even more revealing. Podcast feeds often include dynamic ad insertion, which means different listeners can receive slightly different audio for the same episode. AppleInsider reports that Arment used audio fingerprinting and de-duplication so Overcast can generate one transcript and map it across multiple episode variants, cutting redundant processing while keeping transcript output consistent. (forums.appleinsider.com) This does not mean cloud artificial intelligence is going away. The biggest models still need more memory, more power, and more operational tooling than a phone or a Mac can provide, and some Apple Neural Engine projects still rely on reverse-engineered interfaces rather than official training support. NeuralForge itself says it is built on private framework access, which shows both how much progress the community has made and where the platform limits still are. (github.com) But the line has clearly moved. Apple now has an official framework for local model experimentation and fine-tuning on Apple silicon, Google is shipping a fresh open model family aimed at edge deployment, social demos are showing phone-class offline generation speeds that feel practical, and at least one commercial app is already replacing a cloud speech pipeline with a 48-machine Apple silicon rack. Put together, those are the signs of a market moving from “can this run locally?” to “which local workloads are cheaper, faster, or more private than the cloud?” (machinelearning.apple.com)