AI researcher outperforms humans
An autonomous AI system ran thousands of neural‑architecture experiments and produced far more winning designs than human teams — a sign that AI can now iterate on model structure itself, not just weights. The project ran about 1,773 architecture rounds and found roughly 105 designs that beat human baselines, with the top automated model scoring +0.97 versus a human +0.34 on the reported benchmark and average data‑cleaning gains of +3.96 points. The team also reported large reinforcement‑learning gains (about +12.5 on AMC32) and showed transfer value when the approach lifted drug‑prediction AUROC by +6.94, suggesting this automation could shorten model R&D cycles across domains. (x.com) (x.com)
Most artificial intelligence systems only change their weights, which are the billions of tiny numeric knobs adjusted during training. The blueprint those knobs live inside, called the model architecture, is still usually drawn by humans by hand. (arxiv.org) That blueprint decides where information flows, how memory is stored, and which pieces talk to each other, like choosing the floor plan of a factory before turning on the machines. A bad floor plan can waste months of training even if the weights are tuned perfectly. (arxiv.org) Researchers have tried to automate that step for years with neural architecture search, which is software that tests many blueprints instead of asking a team to guess the best one. A 2021 benchmark called NAS-Bench-360 found that many search methods still struggled to beat simple baselines or human-designed models on most tasks. (openreview.net) The new system, called ASI-Evolve, tries to automate the whole research loop instead of just sampling random designs. Its paper says the agent runs a repeated cycle of learn, design, experiment, and analyze, while storing reusable lessons in a “cognition base” for the next round. (arxiv.org) In the architecture test, ASI-Evolve ran 1,773 rounds and produced 105 state-of-the-art linear attention designs. Linear attention is a way of building long-context models that aims to use less memory than standard attention when sequences get large. (x.com) (arxiv.org) The top machine-found design beat DeltaNet by 0.97 points on the reported benchmark, while the paper says recent human-designed improvements were about 0.34 points. That means the automated gain was roughly three times the size of the cited human gain on that setup. (arxiv.org) The same framework was also pointed at training data, which is the pile of text or examples a model learns from before it ever answers a question. The paper reports an average gain of 3.96 points from the evolved data-curation pipeline, with more than 18 points on Massive Multitask Language Understanding, a broad exam-style benchmark usually shortened to MMLU. (arxiv.org) It also searched for better reinforcement learning rules, which are the trial-and-error recipes used to reward good behavior and punish bad behavior during training. On the AMC32 math benchmark, the discovered algorithm beat Group Relative Policy Optimization, or GRPO, by up to 12.5 points, with another 11.67-point gain on AIME24 and 5.04 points on OlympiadBench. (arxiv.org) The paper says the method carried over into biomedicine too, where it improved drug-prediction area under the receiver operating characteristic curve by 6.94 points. That score measures how well a model separates likely positives from likely negatives across all decision thresholds, so a higher number means cleaner ranking, not just one lucky cutoff. (x.com) (arxiv.org) This is still an arXiv paper posted on March 31, 2026, not a peer-reviewed journal result, and the strongest claims come from the authors’ own benchmarks and comparisons. But if systems can now redesign architectures, training rules, and data pipelines inside one loop, the slowest part of artificial intelligence research may stop being the training run and start being how fast anyone can afford to let the agent keep experimenting. (arxiv.org)