Apple prioritizes local model inference

- Apple has spent the past year turning Apple Intelligence into a local-first system, with developers now able to call its on-device model directly. - Apple’s own technical report says its device model is about 3 billion parameters, while newer developer tools expose offline generation and tool calling. - The shift puts privacy and latency ahead of sheer model size, while cloud work stays on Private Cloud Compute (apple.com)

A language model is software that predicts the next token, like autocomplete stretched into paragraphs, summaries, and app actions. Apple has spent the last year pushing more of that work onto iPhones, iPads, and Macs instead of sending every request to remote servers. (machinelearning.apple.com) (developer.apple.com) Apple laid out that split on June 10, 2024, when it introduced Apple Intelligence with two core systems: a roughly 3 billion-parameter on-device model and a larger server model for Private Cloud Compute. Apple said the local model handles everyday tasks, while the server model runs on Apple silicon in cases that need more scale. (machinelearning.apple.com) The company widened that strategy on June 9, 2025, when it announced the Foundation Models framework at Worldwide Developers Conference 2025. That framework lets developers call the same on-device model that powers Apple Intelligence, including guided generation and tool calling, on iOS 26, iPadOS 26, macOS 26, and visionOS 26. (apple.com) (developer.apple.com) Tool calling is a simple idea with a technical name: the model decides when to ask another piece of software for help. Apple’s documentation says the on-device model can trigger code written by the app developer to search a database, fetch app data, or complete a task. (developer.apple.com) That design keeps Apple’s pitch consistent: short, personal, routine requests stay close to the user, and bigger jobs move to Apple’s own cloud only when needed. Apple says the local model works only when Apple Intelligence is enabled on a supported device, and it can operate without internet connectivity. (developer.apple.com 1) (developer.apple.com 2) Apple’s 2025 technical report added more detail on how it squeezed a multimodal model onto consumer hardware. The paper says the on-device model is optimized for Apple silicon with techniques including key-value cache sharing and 2-bit quantization-aware training, while the server model uses a mixture-of-experts design on Private Cloud Compute. (arxiv.org) The tradeoff is size. A local model has tighter memory and power limits than a hyperscale data center model, so Apple is betting that responsiveness, battery-aware design, and data minimization will win many everyday use cases even if the largest cloud systems remain stronger on open-ended reasoning. (machinelearning.apple.com) (arxiv.org) That same local-first pattern is spreading beyond Apple. On April 22, 2026, OpenAI released Privacy Filter, an open-weight model for detecting and redacting personally identifiable information, and said it is small enough to run locally so sensitive text can be filtered before it leaves a machine. (openai.com) Put together, those moves shift where artificial intelligence work happens. Apple is exposing a built-in local model to app makers, OpenAI is shipping a local privacy model for preprocessing, and the remaining cloud layer is increasingly reserved for the requests that do not fit on the device. (apple.com) (openai.com)

Apple prioritizes local model inference

Get your own daily briefing