Rios‑García: base models drive variance
- Martiño Ríos‑García and coauthors reported on April 20 that across more than 25,000 runs, base model choice dominated scientific-agent performance far more than scaffolding did. - In eight domains, the paper says base models explained 41.4% of variance versus 1.5% for scaffolds, while agents ignored evidence in 68% of traces. - The result cuts against prompt-wrapper optimism around “AI scientists” and shifts attention toward model capability itself. (arxiv.org)
Large language model agents can run scientific workflows, but a new paper says the base model underneath them still does most of the work. (arxiv.org) Martiño Ríos‑García and seven coauthors posted the paper, “AI scientists produce results without reasoning scientifically,” to arXiv on April 20, 2026. It evaluates scientific agents across eight domains and more than 25,000 agent runs. (arxiv.org) The authors split the problem in two: how often agents get the task right, and how they reason while doing it. They then decomposed performance into the contribution from the base model and the contribution from the scaffold, the wrapper that handles prompts, tools, and workflow steps. (arxiv.org) Their headline number is stark: base model choice accounted for 41.4% of explained variance, while the scaffold accounted for 1.5%. In the paper’s framing, swapping the underlying model changed outcomes much more than changing the orchestration around it. (arxiv.org) The behavioral results are similarly blunt. Across configurations, agents ignored evidence in 68% of traces, showed refutation-driven belief revision in 26%, and rarely assembled converging evidence from multiple tests. (arxiv.org) Scientific reasoning here means more than producing a plausible answer. The paper argues that a system should update beliefs when data contradicts a hypothesis and should combine multiple pieces of evidence instead of skating past them. (arxiv.org) The authors say those failures appeared in both kinds of tasks they studied: straightforward workflow execution and open-ended hypothesis-driven inquiry. They also say the pattern persisted even when agents were given near-complete successful reasoning trajectories as context. (arxiv.org) That leaves a narrower role for scaffold engineering than many agent builders advertise. The paper concludes that outcome-based benchmarks can miss process failures, and that scaffold tweaks alone cannot repair weak reasoning if the underlying model has not learned it. (arxiv.org) The closing claim is practical as much as philosophical: until reasoning itself becomes a training target, the paper says, scientific knowledge produced by these agents cannot be justified by the process that generated it. (arxiv.org)