MLX‑LM enables 240K‑token context on one device
- MLX‑LM reported optimization techniques—SpecPrefill, asymmetric KV cache and prompt caching—that delivered a 3.1× faster time‑to‑first‑token and supported 240,000‑token context on a single device. - The approach trades memory and caching strategies to avoid clusters while preserving large‑context inference on a single M‑class machine. - This suggests very large context sizes can shift toward capable client devices with careful memory and cache management. (x.com/i/status/2047644680750285074)