ASI‑Arch discovers 106 attention designs

- Researchers behind ASI-Arch said their autonomous system discovered 106 state-of-the-art linear attention architectures after running 1,773 self-directed experiments over 20,000 GPU hours. - The paper landed on arXiv on July 24, 2025, and claims a compute-to-discovery scaling law — more experiments produced more architectural breakthroughs. - That matters because architecture design has mostly been human-bottlenecked; ASI-Arch argues model-building itself can now become a scalable search process.

Transformer architecture is the domain here — specifically attention, the part of a model that decides what information to keep in view. The stakes are simple: better attention can make models cheaper, faster, and better at long contexts. The gap is that architecture design still depends heavily on humans trying ideas one by one. What changed is that a research team put out ASI-Arch, an autonomous system that does that search itself and says it found 106 new state-of-the-art linear attention designs after 1,773 experiments. ### What is ASI-Arch actually doing? ASI-Arch is not just a parameter tuner. The paper frames it as an end-to-end research loop: propose a new attention idea, write the code, train it, evaluate it, store the result, then use those results to guide the next round. That is the important jump. Older neural architecture search usually explored a human-defined menu. ASI-Arch is pitched as automated innovation, not just automated selection. ### Why focus on linear attention? (arxiv.org) Standard transformer attention gets expensive fast as context length grows. Linear attention is the family of tricks meant to cut that cost, but the tradeoff is usually quality or recall. That makes it a perfect search target — there are lots of possible design choices, and humans still do not have a settled best answer. ASI-Arch went after that messy design space rather than a solved one. ### What did the system claim to find? The headline number is 106 architectures that the authors describe as state of the art. (arxiv.org) The run used 20,000-plus GPU hours and 1,773 autonomous experiments. The repository also says the team open-sourced all 106 designs, which matters because this is not just a benchmark claim in a PDF — other model builders can inspect and reuse the outputs. ### Why call this an “AlphaGo moment”? The comparison is about surprise. AlphaGo’s Move 37 mattered because it showed a machine could find a strong move humans did not naturally reach. (arxiv.org) The paper makes the same argument for model design: the discovered architectures exposed design patterns that beat human baselines and looked non-obvious enough to count as genuine invention, not brute-force recombination. That is the core claim, anyway. ### Is the bigger story the 106 designs? (arxiv.org) Not really. The bigger story is the scaling claim. The authors say they found an empirical scaling law for scientific discovery itself — basically, more compute led to more architecture breakthroughs in a fairly regular way. If that holds up, then architecture research starts to look less like artisanal craft and more like an industrial search process. ### What is the catch? A state-of-the-art claim inside a narrow benchmark slice is not the same thing as rewriting all model design. (arxiv.org) This work is about linear attention, not every architecture problem. And arXiv plus GitHub is not the same as broad adoption. The real test is whether outside labs use these designs, reproduce the gains, and extend the method beyond this one family of modules. ### Why should model builders care? Because attention blocks are load-bearing parts of modern models. (arxiv.org) If an autonomous system can reliably invent better ones, then future labs may spend less time hand-designing layers and more time setting objectives, budgets, and evaluation rules. The human role shifts upward — from drawing the circuit to steering the search. ### Bottom line? ASI-Arch is interesting less because one paper found 106 good ideas and more because it treats architecture research itself as something compute can scale. (arxiv.org) If that generalizes, “designing the model” stops being the bottleneck and starts becoming another loop you can automate.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.