On‑device AI is shipping
Developers are pushing real on‑device AI into product: MLX‑VLM runs vision‑language tasks directly on Apple Silicon for image Q&A and video summarization, while tools like apfel expose macOS LLMs via a CLI so apps can call models offline. (x.com) The trend pairs privacy and latency wins with technical workarounds — offline agent loops, encrypted GPU‑accelerated inference experiments (TFHE) and local logging — that make enterprise and healthcare use cases more viable without cloud dependency. ( )
For years, “on-device AI” mostly meant small tricks. A phone blurred a background. A laptop transcribed a voice note. The real model still lived in a data center. That boundary is starting to move. Developers are now shipping software that runs useful generative models directly on consumer hardware, especially on Apple Silicon, and they are doing it with tools built for real products rather than demos (developer.apple.com, opensource.apple.com). Apple helped create this moment by opening up more of its local stack. Its Foundation Models framework gives apps direct access to the on-device language model behind Apple Intelligence on iPhone, iPad, Mac, and Vision Pro, with support for tasks like text generation, summarization, and tool calling without an internet connection (developer.apple.com, developer.apple.com). At the same time, Apple has kept pushing MLX, its machine learning framework for Apple Silicon, which is tuned for the chips’ unified memory architecture and newer GPU features such as the M5 family’s Neural Accelerators (opensource.apple.com, machinelearning.apple.com). That combination has made the Mac an unusually fertile place for local AI hacking. One of the clearest examples is MLX-VLM, an open-source package for running and fine-tuning vision-language models and multimodal models with audio and video support on a Mac using MLX. The project is not a toy. Its maintainers explicitly pitch image question answering, video understanding, and fine-tuning workflows on local hardware, and the repository has grown into a substantial developer tool with thousands of stars and hundreds of forks (github.com). What matters is not just that it works. It works on a machine people already own. The same shift is happening one layer higher, where developers are trying to turn Apple’s built-in models into ordinary software plumbing. A project called apfel wraps Apple’s on-device foundation model in a command-line interface and local HTTP server, so other apps can call it like a service while keeping inference on the machine. The pitch is blunt: no API keys, no cloud, no per-token billing, no network calls. apfel depends on Apple’s Foundation Models framework and targets macOS 26 and newer, which means this is riding on an official platform capability rather than a jailbreak or private API (github.com, developer.apple.com). Once models run locally, the engineering priorities change. Latency drops because there is no round trip to a server. Privacy improves because raw data can stay on the device. And cost becomes easier to predict because a product team is no longer paying for every token sent to a remote API. Apple is selling that exact package to developers, describing its local model access as a way to build experiences that are “smart, private, and work without internet connectivity” (developer.apple.com). Microsoft is making a similar argument from the enterprise side. In a recent post on HIPAA-compliant medical transcription with local AI, it framed cloud calls as immediate compliance and audit concerns because protected health information leaves the device and lands on third-party infrastructure (techcommunity.microsoft.com). That is why the interesting part of this trend is not the model alone. It is the surrounding machinery. Developers are building offline agent loops that can inspect files, propose changes, run tests, and keep working without sending code or documents off-machine. Many of these projects are still rough, but they show the pattern clearly: local models plus local tools plus local state (github.com, github.com). The model stops being a chatbot and starts looking more like a system component. There is also a more experimental edge to this work. Fully homomorphic encryption is still too expensive for most mainstream inference, but the tooling is getting faster and more practical. Zama’s TFHE-rs documentation now includes GPU execution paths for encrypted computation, where server keys can be decompressed onto GPUs for faster operations on ciphertexts (docs.zama.ai). Older projects like cuFHE showed years ago that TFHE-style schemes could get major speedups from GPUs, even if the hardware and software stack were far from product-ready (github.com). That does not mean encrypted local LLMs are suddenly solved. It means the old objection — that privacy-preserving inference is purely academic — is getting harder to say with a straight face. The result is a different shape of AI product. Not a thin client talking to a giant model in the cloud, but an app that carries more of its own intelligence and leaves a smaller data trail behind. On a recent Mac, that can now mean asking a local vision-language model questions about an image, summarizing a video on the device, or piping a prompt through a CLI that talks to Apple’s built-in model without ever opening a socket (github.com, github.com).