Agent ran 1,773 experiments
An AI agent called ASI‑Evolve ran 1,773 neural‑architecture experiments on its own and produced 105 designs that beat human results — the top design delivered +0.97 improvement versus a human best of +0.34. (x.com/sukh_saroy/status/2042005315689050157)
Neural architecture search is the idea that instead of a human hand-drawing every layer in a model, you let a system try many blueprints and keep the ones that score better on tests, like breeding faster race cars from thousands of prototypes. In language models, that search is expensive because every promising blueprint has to be trained and measured before you know if it works. (arxiv.org) The bottleneck here was linear attention, a family of model designs built to handle long sequences without the usual memory bill of standard softmax attention. Linear attention works more like a running summary than a giant all-to-all comparison, so it can be cheaper at long context lengths but often lags behind the best transformer models in quality. (arxiv.org) One strong human-made design in that area is DeltaNet, which updates memory by replacing old key-value information in a more targeted way than earlier linear-attention systems. A 1.3 billion parameter DeltaNet model trained on 100 billion tokens beat other linear-time baselines such as Mamba and Gated Linear Attention in the NeurIPS 2024 paper that made it a serious benchmark. (arxiv.org) The new paper says an agent called ASI-Evolve did not just suggest tweaks around the edges. It ran a full loop of reading prior work, proposing designs, writing code, launching experiments, and analyzing results across repeated rounds with what the authors call a learn-design-experiment-analyze cycle. (arxiv.org) In the architecture part of the project, the system generated 1,350 candidate models over 1,773 exploration rounds and found 105 linear-attention architectures that beat the human-designed DeltaNet baseline. The best one improved the reported score by +0.97 points, while the paper says recent human-designed gains over the same line of work were about +0.34. (arxiv.org, openreview.net) The way it searched is closer to evolution than to a single brainstorm. The framework keeps a “cognition base” of reusable ideas from earlier rounds and an “analyzer” that turns messy experiment logs into lessons the next round can use, so the agent is not starting from zero each time. (arxiv.org) This was not limited to architecture search. The same paper reports a data-curation pipeline that raised average benchmark performance by +3.96 points, with gains above 18 points on Massive Multitask Language Understanding, and reinforcement-learning algorithms that beat Group Relative Policy Optimization by as much as +12.5 points on AMC32 and +11.67 on AIME24. (arxiv.org) There is also a caution flag attached to the headline numbers. An earlier version of the architecture work was rejected at the International Conference on Learning Representations 2026, and the area chair summary said reviewers found the gains modest or inconsistent on some benchmarks, wanted stronger ablation tests for the key modules, and said comparisons to other automated-discovery methods were not rigorous enough. (openreview.net) So the cleanest reading is not “human researchers are obsolete.” The cleaner reading is that one agentic system was able to spend thousands of experimental shots on a narrow, compute-heavy design problem and come back with a large pile of better candidates, which is exactly the kind of work that usually burns months of researcher time. (arxiv.org, openreview.net) If the result survives follow-up scrutiny, the shift is simple to picture: human researchers set the goal, the budget, and the guardrails, and the agent runs the grind of proposing, coding, testing, and learning from failure at machine speed. The paper’s authors say they also saw early transfer into mathematics and biomedicine, which suggests the bigger bet is not one better attention block but a reusable research loop. (arxiv.org, github.com)