Free RLHF course released
Nathan Lambert, author of The RLHF Book, launched a free companion RLHF course with lectures on reward models, rejection sampling, and implementation details aimed at explaining post‑training pipelines. The course includes a YouTube playlist and landing page for hands‑on learning about preference‑model workflows. (x.com/i/status/2044096504655425698)
Reinforcement learning from human feedback is the step that turns a base language model into a chatbot, and Nathan Lambert just released a free course to teach how that step works. (rlhfbook.com) The course page on RLHF Book lists four lectures: an overview, a session on instruction fine-tuning, reward models, and rejection sampling, a lecture on policy-gradient math, and a lecture on implementation details such as loss aggregation and asynchronous training. (rlhfbook.com) Lambert’s site says the course launched in March 2026, and the homepage changelog says the book site received “lecture videos” that month. A YouTube intro video for the course was posted about a week before April 14, 2026. (rlhfbook.com) (youtube.com 1) (youtube.com 2) The subject is technical but central to current artificial intelligence systems. Lambert’s book describes the usual post-training pipeline as instruction tuning first, then a reward model that scores outputs, then methods such as rejection sampling, reinforcement learning, or direct alignment to push the model toward preferred answers. (rlhfbook.com) (arxiv.org) A reward model is a learned grader: people compare two answers, the system learns which style they prefer, and later that learned score helps choose or train better responses. Rejection sampling is a simpler version of that idea, where a model generates several answers and keeps the one that scores highest. (rlhfbook.com) (manning.com) That machinery moved from research topic to standard product practice after ChatGPT popularized reinforcement learning from human feedback in consumer chatbots. Manning’s description of Lambert’s book says the technique is now part of mainstream post-training for models such as Llama-Instruct, Zephyr, OLMo, and Tülu. (manning.com) Lambert has been building the material in public for more than a year. The web version of the book first appeared on arXiv on April 16, 2025, and the current arXiv entry shows a revised version uploaded on April 4, 2026. (arxiv.org) The course also sits next to other open teaching material on the site, including slide decks, source links, and a model-completions library for comparing intermediate post-training stages. That gives readers a way to move from a 220-page text into short lectures and code-oriented examples without paying for the Manning edition. (rlhfbook.com 1) (rlhfbook.com 2) (manning.com) For people trying to understand why chatbots answer the way they do, the new course focuses less on pretraining and more on the last-mile tuning that shapes tone, helpfulness, and refusal behavior. That is the part of the stack Lambert’s book calls “post-training,” and it is now available in a free lecture series. (rlhfbook.com)