On‑device AI tooling matures

Open projects and small vendors are shipping Apple‑Silicon tooling—oMLX offers continuous batching and KV cache to SSD for local servers, LM Studio has acquired an iPhone local‑AI app, and CoreML optimizers like Anemll are being highlighted for iPhone inference. ( )

Running artificial intelligence models on Apple devices is shifting from hobby demos to full software stacks for Macs and iPhones. In the past week, three separate projects pushed that shift forward with new Apple‑silicon tooling. (opensource.apple.com) Apple’s MLX framework is the base layer for part of this stack: it is an array framework for machine learning on Apple silicon, designed around the chips’ unified memory. On April 8, LM Studio said it acquired Locally AI, an app that runs models on iPhone, iPad, and Mac, and said creator Adrien Grondin will join the team to lead native experiences across devices. (opensource.apple.com, lmstudio.ai) Another piece is the model “cache,” a saved working memory that lets a model avoid recomputing the same prompt over and over. oMLX, a Mac inference server built on MLX, says it stores that cache in two tiers — hot blocks in random access memory, cold blocks on solid-state storage — and handles concurrent requests with continuous batching. (omlx.ai, github.com) That changes what “local” means on Apple hardware. Instead of one chat window running one model, vendors are now pitching Macs as small personal servers that can feed coding tools through OpenAI-compatible and Anthropic-compatible application programming interfaces. (omlx.ai) The phone side is moving too. Locally AI’s App Store listing says the app runs models such as Llama, Gemma, and Qwen on iPhone and iPad, offline and without login or data collection, and says it is powered by Apple MLX. (apps.apple.com) A third layer is Apple’s own inference runtime, Core ML, which Apple says supports generative models, stateful models, and weight compression for on-device execution across central processor, graphics processor, and Neural Engine hardware. That is where projects like Anemll are aiming: the Anemll repository describes itself as an open-source effort to port large language models to the Apple Neural Engine, and its latest listed release is version 0.3.5 Beta. (developer.apple.com, github.com) In plain terms, MLX is being used to run models well on Macs, while Core ML and Neural Engine tooling are being tuned to squeeze models onto phones with lower power draw. Apple’s own documentation says Core ML runs models fully on device and is built to minimize memory use and power consumption. (developer.apple.com, machinelearning.apple.com) oMLX’s pitch is that long prompts do not have to be rebuilt from scratch every time an agent changes direction. Its site says second-turn time to first token can fall below five seconds on long contexts by restoring cache blocks from disk, and says the software supports Apple silicon Macs on macOS 15 or later, with 16 gigabytes of memory minimum and 64 gigabytes or more recommended for larger models. (omlx.ai) LM Studio’s move suggests desktop local-artificial-intelligence vendors now see mobile as part of the same product, not a side app. The company said it plans “new ways” to use models and agents across a user’s own devices, starting from the April 8 acquisition. (lmstudio.ai) Taken together, the new pieces describe a fuller Apple-device stack: MLX for Mac-native serving, mobile apps that keep inference on iPhone and iPad, and Core ML optimizers aimed at the Neural Engine. The result is not one launch, but a tighter path for running open models locally across Apple hardware. (opensource.apple.com, lmstudio.ai, omlx.ai, developer.apple.com, github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.