oMLX persists KV cache to SSD
- Jun Kim’s oMLX for Apple Silicon is pushing SSD-backed KV cache into mainstream local inference, turning long-context Mac sessions into resumable work instead of repeated recompute. - The concrete win is startup speed on reused context — oMLX says agent time-to-first-token can fall from roughly 30–90 seconds to under 5. - That matters because coding agents constantly invalidate context; persisting cache to disk makes longer runs, restarts, and multitasking much more practical.
Local inference on a Mac usually breaks in a very specific way. The model itself fits, the tokens stream, and then a coding agent changes context mid-session and your machine has to rebuild a huge chunk of attention state from scratch. That rebuild is the drag. It is why long local sessions feel great right up until they don’t. oMLX is trying to fix exactly that by persisting KV cache blocks to SSD instead of treating them as disposable RAM-only state. ### What is KV cache, really? KV cache is the model’s working memory for everything it has already attended to in the current context window. Reusing that memory is what makes long prompts and ongoing conversations fast enough to feel interactive. Lose the cache, and the model has to reread old context the expensive way — token by token, layer by layer. ### Why is local Mac inference bad at this? Because agent workflows are messy. (github.com) A coding assistant edits files, injects tool results, swaps system instructions, and jumps between related prompts. That can invalidate normal in-memory cache layouts, so engines fall back to recomputing large shared prefixes. On Apple Silicon, where unified memory is precious and you may be multitasking with real apps, that recompute hurts twice — latency goes up and memory pressure gets uglier. (bentoml.com) ### What changed in oMLX? oMLX now leans into a two-tier cache design — hot cache in memory, cold cache on SSD. The project page says past context stays cached and reusable across requests even when context changes mid-conversation, and the current Mac build listings describe KV cache blocks being restored instantly from SSD instead of recomputed. In plain English, the engine is treating your disk less like slow storage and more like a parking garage for attention state. (geeky-gadgets.com) ### Why does SSD persistence help so much? Because recomputing old context is often the slowest part of long agent loops. If the engine can reload previously built KV blocks, it skips a pile of redundant work before the first fresh token appears. oMLX’s site frames the payoff in user terms: long-context agent time-to-first-token drops from about 30–90 seconds to under 5 seconds when reused context can come back from SSD. That is the difference between “local but annoying” and “local enough to keep using.” (github.com) ### Is this just about speed? Not really — it is also about survivability. Disk-backed cache means a session can tolerate restarts and memory churn better than a pure RAM cache. That is especially useful on laptops and desktops doing real work at the same time, because the model no longer has to monopolize memory just to preserve every useful prefix. The whole pitch is smoother multitasking, not just benchmark bragging rights. (omlx.ai) ### What is the catch? SSD caching adds engineering complexity and some overhead. oMLX’s release notes explicitly say cache snapshotting and BatchGenerator cost a few percent, and one known dip on long hybrid-attention workloads is still being tracked. There is also at least one open crash report tied to SSD cache on a hybrid model in 0.3.8.x and 0.3.9.dev1, so this is powerful but not magic and not bug-free yet. ### Why does this matter beyond one app? (geeky-gadgets.com) Because local AI on Apple Silicon has been missing a good answer to long-running agent memory. oMLX is basically saying the right abstraction is not “keep everything in RAM” but “tier the cache like an operating system.” If that idea sticks, more Mac-native inference stacks will probably copy it. ### Bottom line? The interesting part is not that oMLX got a bit faster. It is that it treats KV cache as something worth saving across interruptions, not something to rebuild every time the workflow gets complicated. (github.com) For local agents on Macs, that is a much bigger shift than a small benchmark win. (github.com)