oMLX: Mac‑native inference server

oMLX turns Apple Silicon laptops into local inference infra with continuous batching, tiered KV cache (RAM/SSD), multi‑model support, OpenAI/Anthropic API compatibility, and a web dashboard — the project has 6,600+ GitHub stars and targets local, cost‑efficient serving. It’s a practical option for fast prototyping and edge‑scale inference. (x.com)

The jundot/omlx GitHub repository lists roughly 7.9k stars and about 707 commits, and the project is published under an Apache‑2.0 license with a v0.3.0 release available for download. (github.com) oMLX’s core runtime implements a two‑tier KV cache that persists cache blocks to SSD in safetensors format (hot RAM + cold SSD) with LRU eviction, allowing previously seen prefixes to be restored across requests and server restarts rather than recomputed. (omlx.ai) The server uses mlx‑lm’s continuous batching (BatchGenerator) to handle concurrent requests and the project’s benchmarks report up to a 4.14× generation speedup at 8× concurrency compared to single‑request serving. (omlx.ai) oMLX exposes both OpenAI‑compatible HTTP endpoints (e.g., /v1/chat/completions) and an Anthropic‑compatible /v1/messages endpoint, explicitly calling out compatibility with Claude Code, OpenClaw, and Cursor as drop‑in backends. (github.com) Published benchmarks were run on an M3 Ultra 512GB rig and the project documentation recommends a 16GB minimum with 64GB+ for comfortable work on larger models; the site lists examples such as Qwen3.5‑122B‑A10B‑4bit in its performance notes. (omlx.ai) Packaging and ops details include a signed, notarized macOS.dmg with in‑app auto‑update plus a Homebrew tap (brew tap jundot/omlx) and a brew services mode for background operation; visible community activity includes multiple forks and a Windows port (jadumate/omlx4win) indicating active ecosystem experimentation. (github.com)

oMLX: Mac‑native inference server

Get your own daily briefing