Recursive Language Models
- MIT unveiled Recursive Language Models that let models write code to search documents recursively instead of relying on standard RAG techniques. - The approach reportedly handles 10 million-plus tokens and scored 58 on long-context benchmarks versus 0.04 for standard models. - RLMs aim to solve very long-document tasks by delegating searches to sub-AIs, altering how long-context problems are architected (x.com).
Language models read text through a fixed-size window, and their accuracy often drops as that window fills up. MIT researchers say a new setup called Recursive Language Models lets the model search huge documents with code instead of trying to read everything at once. (arxiv.org) The paper was posted on arXiv on Dec. 31, 2025, revised on Jan. 28, 2026, and lists Alex L. Zhang, Tim Kraska, and Omar Khattab of MIT Computer Science and Artificial Intelligence Laboratory. MIT CSAIL also hosted a talk on the work on Feb. 25, 2026. (arxiv.org) (csail.mit.edu) The basic idea is to treat a long prompt like an external file, not like one giant message stuffed into the model. The model gets a Python-style environment where it can split text, search it, and call smaller sub-queries on selected pieces before writing an answer. (arxiv.org) (infoq.com) That differs from retrieval-augmented generation, the common method where a system fetches a few relevant chunks and feeds them back to the model. Recursive Language Models instead let the model write the retrieval steps itself while it works through the document. (arxiv.org 1) (arxiv.org 2) MIT says the method handled inputs up to two orders of magnitude beyond the base model’s context window. In the paper’s main figure, the authors compare GPT-5 and a GPT-5-based recursive system on tasks that scale from 8,192 to 262,144 tokens, with the recursive setup continuing beyond GPT-5’s 272,000-token window. (arxiv.org 1) (arxiv.org 2) On the OOLONG-Pairs benchmark, the paper reports an F1 score of 58.0% for the GPT-5-based recursive system, while GPT-5 and Qwen3-Coder stayed below 0.1%. Alex Zhang’s project page also says the system did not show performance degradation at 10 million-plus tokens in the team’s tests. (arxiv.org) (alexzhang13.github.io) The paper says Recursive Language Models beat “vanilla frontier LLMs” and other long-context scaffolds across four tasks at comparable cost, and a post-trained small model called RLM-Qwen3-8B beat base Qwen3-8B by 28.3% on average. The authors frame that as an inference-time method, meaning the gain comes from how the model is used at run time, not from expanding the transformer’s native context window. (arxiv.org) Other researchers are already testing the idea’s limits. An Apple paper posted on arXiv on March 7, 2026 said a related method called Self-Reflective Program Search outperformed Recursive Language Models by as much as 22% under the same time budget and argued that recursion alone may not be the main reason for the gains. (arxiv.org) MIT’s paper does not present Recursive Language Models as a new foundation model architecture. It presents them as a control layer that lets existing models inspect long text more like analysts paging through files than readers trying to memorize a whole book at once. (arxiv.org) (infoq.com) The immediate question is whether long-context systems will keep racing to bigger windows or shift toward tools that search, slice, and recurse over external text. MIT’s results put that tradeoff into the open with concrete numbers instead of bigger token counts alone. (arxiv.org)