An AI that can run research loops
A new system called ASI‑Evolve reportedly ran end‑to‑end research cycles—designing experiments, analyzing results, and self‑improving—to generate large wins in model design and benchmarks. ( ) The paper claims ASI‑Evolve discovered 105 novel neural architectures, boosted MMLU performance by about +18 points via data pipeline changes, and produced RL algorithms that beat GRPO by up to +12.5 on math tests—raising questions about automating parts of AI R&D. (x.com)
An AI That Can Run Research Loops Most artificial intelligence systems are good at one move, not the whole game. They can answer a question, write code, or summarize a paper, but they usually stop before the slow part begins: deciding what to test next, running the test, reading the result, and changing course. A new paper argues that this boundary is starting to move. The system, called ASI-Evolve, is presented as an artificial intelligence researcher that can run repeated research cycles on its own across model architecture, training data, and reinforcement learning algorithms. (arxiv.org) To see why that claim matters, it helps to start with how modern artificial intelligence research usually works. A team begins with a hunch, turns that hunch into code, launches experiments that can take hours or days, compares the numbers, writes down what failed, and tries again. Progress often comes less from one brilliant idea than from surviving dozens or hundreds of these loops without losing track of what each result means. That loop is expensive because the feedback is slow and messy. (arxiv.org) This is different from the kind of task where an artificial intelligence agent gets an answer in seconds and can immediately tell whether it was right. In research, the reward signal is weak. A change might help one benchmark and hurt another. A promising result might disappear when the model is scaled up. An experiment might fail for a boring reason like bad data filtering rather than a deep scientific insight. The paper’s central claim is that ASI-Evolve can handle this longer, noisier style of work. (arxiv.org) The framework described in the paper runs a four-step loop: learn, design, experiment, analyze. In plain English, that means it first reads from a stored body of prior knowledge, then proposes a new candidate method, then runs an actual evaluation, then turns the outcome into a lesson that can shape the next round. The authors say two pieces are especially important: a cognition base, which stores reusable human and machine priors, and an analyzer, which converts raw experiment results into structured takeaways. (arxiv.org) That setup is trying to solve a very old lab problem. If you only generate ideas, you drown in bad ones. If you only run experiments, you drown in numbers. If you only summarize results, you risk forgetting the exact conditions that produced them. A useful research loop has to connect all three. ASI-Evolve is pitched as a system that keeps those pieces tied together tightly enough to improve itself over multiple rounds instead of acting like a one-shot assistant. (arxiv.org) The paper says the system was tested on three core parts of artificial intelligence development. The first was neural architecture design, meaning the shape of the model itself. The second was pretraining data curation, meaning which text gets included or filtered before a model learns from it. The third was reinforcement learning algorithm design, meaning the rules used to improve a model from feedback after pretraining. The authors frame this as a broader test than most agent papers, which usually focus on a single narrow domain. (arxiv.org) The architecture result is the most concrete. The paper says ASI-Evolve discovered 105 state-of-the-art linear attention architectures. Its best model beat DeltaNet by 0.97 points, which the authors describe as nearly three times the gain from recent human-designed improvements in that line of work. DeltaNet itself is a linear-time alternative to standard transformer attention that was designed to improve efficiency while staying competitive on language modeling and retrieval-heavy tasks. ASI-Evolve’s claim is not that it invented the whole field from scratch, but that it found better designs inside an already active research area. (arxiv.org) The data result may be easier to grasp. Before a large language model becomes useful, it is trained on huge piles of text, and the exact recipe for selecting and cleaning that text can change the final model a lot. ASI-Evolve reportedly improved average benchmark performance by 3.96 points through evolved data curation pipelines, with gains of more than 18 points on Massive Multitask Language Understanding, usually called MMLU. MMLU is a 57-subject test that covers areas like mathematics, history, law, and computer science, so a jump that large suggests the system may have found a much better way to choose what the model studies before the exam. (arxiv.org) The reinforcement learning result is the flashiest. The paper says ASI-Evolve discovered algorithms that beat Group Relative Policy Optimization, or GRPO, by up to 12.5 points on American Mathematics Competitions 32, 11.67 points on American Invitational Mathematics Examination 2024, and 5.04 points on OlympiadBench. GRPO became widely known through DeepSeek’s math work as a way to train reasoning models while avoiding some of the cost of older reinforcement learning setups. Beating it on hard math benchmarks is the kind of claim that will get immediate attention from labs building reasoning systems. (arxiv.org) There is also a subtle shift in what counts as “automation” here. Earlier generations of artificial intelligence coding agents mostly sped up the hands-on part of research: writing scripts, fixing bugs, launching jobs. ASI-Evolve is aimed one layer higher. It is trying to automate the choice of what to test, how to interpret the result, and how to turn that interpretation into the next idea. That is much closer to the work that senior researchers and research leads actually spend time on. (arxiv.org) That does not mean the paper proves artificial intelligence can now replace an artificial intelligence lab. The results are reported by the authors in a fresh arXiv preprint posted on March 31, 2026, not yet a mature consensus result reproduced across the field. The benchmarks are impressive, but they still live inside selected experimental settings chosen by the paper’s authors. The strongest near-term reading is narrower: parts of artificial intelligence research that look like repeated search over code, data recipes, and training rules may be more automatable than many people assumed a year ago. (arxiv.org) If that reading holds up, the consequences are straightforward. A system that can run many research loops without getting tired changes the bottleneck from “Who has the next idea?” to “Who has the compute, the evaluation harness, and the taste to decide where these loops should run?” In that world, the fastest labs may not just build better models. They may build better systems for discovering better models. ASI-Evolve is an early argument that this shift has already started. (arxiv.org)