Hugging Face ships TRL v1.0

Hugging Face released TRL v1.0, a unified post‑training stack that bundles SFT, reward modeling, DPO, and GRPO workflows to standardize how LLMs are fine‑tuned for production. The update is pitched as a one‑stop toolchain for preference‑based tuning and evaluation. (marktechpost.com)

Hugging Face published the TRL v1.0 announcement on March 31, 2026, crediting Quentin Gallouédec with co-authors Steven Liu, Pedro Cuenca, and Sergio Paniego on the official blog post. (huggingface.co) The project's GitHub shows a formal v1.0.0 release (commit f3e9ac1) and an associated release entry that summarizes the milestone changes. (github.com) TRL traces its lineage back more than six years of commits and the v1.0 announcement states the codebase now implements over 75 post‑training methods. (huggingface.co) v1.0 introduces Asynchronous GRPO, which offloads generation rollouts to an external vLLM server so generation runs in parallel with training to reduce idle GPU time. (github.com) The release adds several experimental trainers and algorithms — including VESPO (Variational Sequence‑Level Soft Policy Optimization), DPPO (Divergence Proximal Policy Optimization), and SDPO (Self‑Distillation Policy Optimization) — with VESPO specifically designed to smooth sequence‑level importance weights. (github.com) Example snippets in the release demonstrate AsyncGRPOTrainer usage on models like Qwen/Qwen2.5-0.5B-Instruct with the trl-lib/DeepMath-103K dataset, highlighting workflows aimed at verifier-style math and code tasks. (github.com) The trl repository lists roughly 17.9k stars and 2.6k forks on GitHub, and third‑party coverage has reported the library sees on the order of 3 million downloads per month. (github.com)

Hugging Face ships TRL v1.0

Get your own daily briefing