oMLX crushes Mac LLM lag
Developers report oMLX is delivering near‑instant LLM inference on Apple Silicon by using unified memory, SSD caching and smarter batching — a tangible win for on‑device AI performance. The improvement is being discussed alongside new quantization and benchmarking tools for M‑series chips, showing Apple’s vertical stack can squeeze big efficiency gains out of real hardware/software co‑design. ( )
The oMLX GitHub shows ~6.5k stars, 516 forks and 556 commits, and the project ships as a signed, notarized macOS.dmg with macOS 15+ and Python 3.10+ listed as requirements. (github.com) oMLX persists KV cache blocks in safetensors and implements a hot‑in‑RAM / cold‑on‑disk cache with LRU eviction, and it relies on mlx‑lm’s BatchGenerator for handling concurrent requests. (omlx.ai) The project’s community benchmarks index contains 33,369 submitted runs and includes examples such as an M2 Max (30c) reporting 91.9 (units shown) on a Qwen3.5‑27B 4bit 4k run and M5 Max results peaking around 817.8 on 4k workloads. (omlx.ai) oMLX exposes both OpenAI‑compatible and Anthropic‑compatible API endpoints, advertises drop‑in compatibility with Claude Code, OpenClaw and Cursor, and can reuse an existing LM Studio model directory while offering a web admin UI and a one‑click config generator. (omlx.ai) Parallel to oMLX, community tooling for quantized LLMs has matured: christancho’s llm‑quantization‑benchmark offers cross‑platform quantization comparisons, Apple’s Core ML Tools docs highlight quantization benefits for memory‑bound models on Apple silicon, and write‑ups comparing GGUF, GPTQ and AWQ examine tradeoffs for edge deployment. (github.com) Operational guidance published by the project lists 16GB as a minimum and recommends 64GB+ for heavier models, calls M‑series Pro/Max with 64GB the “sweet spot” for coding workflows, and notes that its published benchmarks were run on an M3 Ultra 512GB configuration. (omlx.ai)