MLX‑LM‑LoRA v2.1.0 released

- MLX published LM‑LoRA v2.1.0 which adds Quantization Aware Training for supervised and preference tuning. - The release also enables vision‑language model (VLM) merging and running Olmo 3 7B Instruct QAT 4‑bit on Apple Silicon. - These features are designed to make efficient fine‑tuning and model merging more practical for small teams and edge deployments (x.com).

MLX‑LM‑LoRA 2.1.0 adds a way to train models while “practicing” for lower precision, so the final 4‑bit or 8‑bit version holds up better on Macs. (pypi.org) The package was released on PyPI on April 23, 2026, and its new Quantization Aware Training feature works with supervised fine‑tuning, Direct Preference Optimization, and Odds Ratio Preference Optimization. (pypi.org) Quantization is the step that shrinks a model by storing weights in fewer bits, like compressing a photo to fit on a phone. MLX‑LM itself is Apple’s Python package for running and fine‑tuning large language models on Apple silicon, with support for quantized models and Hugging Face downloads. (github.com) In this release, MLX‑LM‑LoRA says it can apply quantization projection during training, with settings for 4‑ to 16‑bit formats, group or per‑tensor schemes, and configurable start and interval steps. The project describes that as a way to simulate quantization effects before export instead of after the model is already trained. (pypi.org) That changes the tradeoff for small teams using Apple laptops and desktops, because a model that survives 4‑bit conversion can run in less memory than a full‑precision version. MLX‑LM’s default example model is already a 4‑bit Llama 3.2 3B Instruct variant, which shows how central low‑precision inference has become in the MLX ecosystem. (github.com) The release also points to a new MLX Community model, `Olmo-3-7B-Instruct-mxfp4-QAT`, and its model card says it was fine‑tuned with MLX‑LM‑LoRA 2.1.0. The repository lists a 3.88 GB safetensors file and tags the model as 4‑bit precision. (huggingface.co 1) (huggingface.co 2) Olmo 3 is Ai2’s open model family, and Ai2 says the 7B Instruct version is built for multi‑turn chat and tool use. A 7‑billion‑parameter chat model compressed to 4‑bit is the kind of target that can fit local workflows on Apple hardware more easily than its larger full‑precision counterpart. (allenai.org) (huggingface.co) PyPI’s project page also says MLX‑LM‑LoRA supports synthetic dataset creation for prompts, supervised fine‑tuning data, and preference data, plus training a custom preference model for online preference tuning. Those additions sit alongside older LoRA, DoRA, QLoRA, and full‑precision training modes already listed by the package. (pypi.org) The result is not a new base model but a wider set of tools for adapting and shrinking open ones on Apple silicon. On April 23, 2026, the main claim of version 2.1.0 is simple: train for the smaller model you plan to ship, not just the larger one you start with. (pypi.org)

MLX‑LM‑LoRA v2.1.0 released

Get your own daily briefing