ContextCurator filters noise
Researchers released ContextCurator, a 7B RL-trained model that filters structural noise from agent observations to keep high-signal context for downstream reasoning (e.g., DOM, retrievals). The paper reports it removes ~90% of structural noise and raised WebArena success from 36.4% to 41.2% and DeepSearch from 53.9% to 57.1% via compressed high-signal memory (x.com).
Large language model agents often fail for a simple reason: they keep too much junk in memory. A new paper says a separate 7 billion-parameter model called ContextCurator can strip that clutter before the main model reasons. (arxiv.org) The paper, “Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning,” was posted in April 2026 by researchers from Tongji University, Stanford University, CurrentsAI Research, and an independent coauthor Yang Li. It pairs a frozen “task executor” model with a smaller policy model trained to keep only task-relevant context. (arxiv.org) The underlying problem is long-horizon work: web browsing agents and retrieval agents collect observations over many steps, then lose track of what matters. The authors write that raw web page structure, including document object model trees, can contain more than 90 percent structural noise such as styling code, ads, and repeated navigation elements. (arxiv.org) WebArena is one of the benchmarks used to test that kind of agent. It is a simulated web environment introduced in 2023 with functional sites for shopping, forums, software development, and content management, built to measure whether agents can complete realistic browser tasks. (arxiv.org) In the new paper, the authors report that adding ContextCurator raised Gemini-3.0-flash’s WebArena success rate to 41.2 percent from 36.4 percent. They also report token use fell 8.8 percent, to 43.3 thousand from 47.4 thousand. (arxiv.org) The same setup improved performance on a deep-search retrieval task to 57.1 percent from 53.9 percent, according to the paper. The authors say that result came with an eightfold reduction in tokens by storing a compressed, high-signal working memory instead of the full observation stream. (arxiv.org) The training method matters here. Instead of asking one model to both solve the task and decide what to remember, the researchers used reinforcement learning to train the curator as a specialist that prunes noise while preserving what the paper calls “reasoning anchors,” or small facts needed for later steps. (arxiv.org) The paper also says the 7 billion-parameter curator matched the context-management performance of OpenAI’s GPT-4o in its setup. That claim points to a broader design shift in agent research: use a cheaper model to manage memory and reserve the larger model for actual decision-making. (arxiv.org) Other recent agent papers have pushed on the same bottleneck from different angles. FocusAgent, a 2025 paper, reported trimming noisy context by more than 50 percent on WebArena-style tasks while maintaining baseline performance, suggesting context selection is becoming its own subproblem in agent design. (arxiv.org) The immediate test for ContextCurator will be whether its gains hold outside the paper’s benchmarks and model pairings. For now, the result is narrower and more concrete: one smaller model handled the remembering, and the larger agent got better at the task. (arxiv.org)