New RL paper: distributional outputs
A fresh paper called 'Reaching Beyond the Mode: RL for Distributional Reasoning' shows reinforcement learning can train LLMs to emit diverse, plausible answers in one pass — useful for ambiguous tasks like coding or QA without repeated sampling. That technique could reduce latency and cost where multiple samples were previously required. (x.com)
ArXiv lists "Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models" (arXiv:2603.24844) as submitted on March 25, 2026, with authors Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, and Yoon Kim (correspondence: ishapuri@mit.edu). (arxiv.org) The authors published code in the ishapuri/multi_answer_rl repository on GitHub and have a project presence under the multi-answer-rl GitHub organization and site at multi-answer-rl.github.io. (github.com) The repo includes a medical dataset loader referencing mehuldamani/medDataset_25k, described as 25,000 patient cases drawn from DDXPlus used for differential-diagnosis experiments. (github.com) Provided training artifacts include configs referencing Qwen3-8B, a deepspeed.yaml tuned for four-GPU ZeRO-2 runs, and an explicit instruction to install a pinned TRL commit before training (the README shows accelerate/deepspeed launch commands). (github.com) The paper's abstract reports measurable improvements in diversity, coverage, and set-level calibration across QA, medical-diagnostic, and coding benchmarks, and states the multi-answer RL models use fewer tokens to produce multiple answers and show substantial accuracy gains on coding tasks. (arxiv.org) Reproducibility materials in the repository surface named training modes (RLCR and RLVR), reward functions including a Brier component, evaluation scripts, and example eval configs so practitioners can run the same RL setups end-to-end. (github.com)