Automated model evolution claims
Two social posts described research advances: the METR benchmark claims frontier models now handle 15+ hours of expert work at ~50% reliability, and an ASI‑Evolve framework reportedly produced 105 architectures outperforming human designs with sizable data and algorithm gains. Both posts suggest automated architecture and evaluation techniques are accelerating model‑level improvement. (x.com (x.com)
Artificial intelligence researchers are trying to measure two things at once: how long a model can work on its own, and whether models can now help design better models. (metr.org) Model Evaluation and Threat Research, or METR, said in a March 19, 2025 paper that it tested frontier systems on 170 software tasks and estimated a “50% task-completion time horizon” by comparing model success rates with how long human experts took on the same work. The group reported that recent frontier agents could handle tasks that take humans about 50 minutes at a 50% success rate, and that this horizon had doubled about every seven months since 2019. (arxiv.org) METR’s public tracker, last updated March 3, 2026, says the benchmark now shows much longer horizons for the newest systems and defines the metric as the task duration where an agent succeeds half the time. METR also said a 50% horizon does not mean those tasks are ready to be delegated in production, because many real jobs need far higher reliability. (metr.org) A separate paper posted to arXiv on March 30, 2026 described ASI-Evolve, an “agentic” search system that iterates through ideas, runs experiments, and keeps records of what worked. The authors said it improved three parts of model building: data curation, learning algorithms, and neural architecture design, which is the process of choosing how a model is wired internally. (arxiv.org) In that architecture search, the ASI-Evolve authors said the system found 105 state-of-the-art linear attention designs, and that the best one beat DeltaNet by 0.97 points. The paper said that gain was nearly three times larger than recent human-designed improvements in the same line of work. (arxiv.org) The same paper said the system also found new data mixtures that raised training efficiency and reinforcement learning updates that improved performance in coding and math settings. The authors framed the project as a unified framework for using artificial intelligence to improve the ingredients that go into later models. (arxiv.org) These results are claims from research papers and project pages, not audited industry standards or product guarantees. METR’s own limitations note says time horizon is an estimate with wide confidence intervals and is not the same thing as “how long AIs can work independently.” (metr.org) The two lines of work meet at the same pressure point: one tries to measure how much expert labor current systems can replace on software tasks, while the other tries to automate the search for better systems. If both trends hold up under replication, the pace of model improvement would depend less on hand-tuned human trial and error. (metr.org; arxiv.org)