Apple Silicon ML Gets Boost with mlx-lm Update

Apple's MLX library for local inference on Apple Silicon just got a significant upgrade. The new `mlx-lm` version adds support for the Qwen3.5 model, improves KV caching, and enables file sharing via `mlx-distributed`. The update makes it easier for developers to run powerful language models efficiently on their MacBooks for side projects and local experimentation.

Apple's MLX, an open-source machine learning framework, was first released in December 2023 by Apple's machine learning research team. It is specifically designed for efficient model training and deployment on Apple Silicon, leveraging the unified memory architecture of M-series chips. This design avoids the need to copy data between CPU and GPU memory, a significant advantage for running large models. The framework's API is intentionally similar to NumPy and PyTorch, making it familiar for developers already in the Python data science ecosystem. MLX supports a range of machine learning tasks, including large-scale text generation, image generation with models like Stable Diffusion, and speech recognition with OpenAI's Whisper. This familiarity and broad support have led to a growing community on Hugging Face, where thousands of models have been converted to the MLX format. Running large language models locally provides significant benefits in terms of privacy, cost, and latency. Since data is processed on-device, there's no need to send potentially sensitive information to external servers. This also eliminates recurring API fees and network-dependent delays, offering developers more control and faster iteration cycles for their projects. The recent addition of Qwen2, a series of powerful language models from Alibaba Cloud, further expands the capabilities of MLX. These models, ranging from 0.5 to 72 billion parameters, have demonstrated strong performance across benchmarks for language understanding, coding, and reasoning. Native support in `mlx-lm` means developers can easily run these advanced models on their local machines. Improvements to Key-Value (KV) caching are critical for enhancing the performance of large language models, especially during long interactions. The KV cache stores intermediate results from the model's attention mechanism, speeding up the generation of subsequent tokens. More efficient caching reduces memory usage and processing time, making it faster to work with long documents or maintain extended conversations with a model. The introduction of `mlx-distributed` facilitates the sharing of model files and the distribution of computation across multiple machines. This feature leverages high-speed interconnects like Thunderbolt to enable parallel processing for training and inference. For developers with multiple Apple Silicon devices, this can dramatically reduce the time required for computationally intensive tasks.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.