New Tool Eases Local LLM Inference on Apple Silicon

A new open-source tool called oMLX has been launched to provide a practical menubar server for running large language models locally on Apple Silicon. It features continuous batching and an OpenAI-compatible API, making it easier for developers to work with on-device AI on M-series chips.

oMLX is built on Apple's own MLX framework, an open-source library from Apple's machine learning research designed for efficient computation on Apple silicon. MLX leverages the unified memory architecture of M-series chips, allowing operations to run on the CPU and GPU without data duplication, which optimizes performance for AI tasks. This foundation allows oMLX to directly harness the hardware's capabilities. The tool's key differentiator is its paged SSD caching mechanism. Unlike in-memory caches used by tools like Ollama and LM Studio which are invalidated when an agent's context shifts, oMLX persists Key-Value cache blocks to the SSD. This allows for near-instant restoration of previously computed contexts, dramatically reducing the Time-to-First-Token (TTFT) in complex, multi-turn interactions typical of coding agents from over 30 seconds to under 5. Performance benchmarks curated by the oMLX community show significant throughput gains from its continuous batching feature, with up to a 4.14x speedup under concurrent loads on an M3 Ultra. A systematic academic study comparing local LLM runtimes on Apple Silicon found that MLX-based solutions achieve the highest sustained generation throughput, outperforming alternatives like Ollama and PyTorch MPS for many use cases. The developer, who goes by the GitHub handle "jundot", created oMLX to solve personal frustrations with existing tools for agentic coding workflows. The project is open-source under the Apache 2.0 license and has seen rapid development, with recent versions adding support for Vision-Language Models (VLMs) that leverage the same SSD caching system. This aligns with the broader industry trend of enabling more complex, on-device AI experiences, a strategic focus for Apple's iOS and macOS development. This push for on-device AI directly impacts talent retention in Silicon Valley, where the demand for engineers with AI and machine learning expertise is causing salaries to soar, often exceeding $1 million annually. Companies are engaged in aggressive recruiting and use strategies like non-compete clauses to retain top talent, as the concentration of AI engineers remains highest in the Bay Area (35%) and Seattle (23%). The intense, fast-paced environment has also led to discussions of engineer burnout as teams race to meet aggressive AI timelines. Concurrently, new U.S. Department of Commerce regulations are tightening export controls on advanced AI hardware. These rules are expanding to require government approval for global sales of high-performance chips from companies like Nvidia and AMD, shifting from country-specific restrictions to a worldwide licensing framework. This policy aims to position the U.S. as a gatekeeper for AI infrastructure, potentially impacting supply chains and requiring purchasers of large AI accelerator quantities to invest in U.S.-based infrastructure.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.