ASI‑Arch finds 106 attention designs
- ASI‑Arch, a multi‑agent research system from Shanghai Jiao Tong collaborators, was shown to autonomously design and test linear‑attention architectures, yielding 106 new state‑of‑the‑art variants. - The run spanned 1,773 experiments and more than 20,000 GPU hours, with the team claiming the search uncovered design patterns humans had not mapped. - The bigger point is workflow — architecture discovery starts to look compute‑scalable, not purely bottlenecked by human trial and intuition.
Transformer design has a nasty bottleneck. Everybody knows attention is the engine, but once you try to make it cheaper for long contexts, the design space explodes. There are too many knobs, too many tradeoffs, and too many ways to build something fast that quietly gets worse. What changed here is that a team behind a system called ASI‑Arch says it let an AI research loop do that exploration itself — and it came back with 106 new linear‑attention architectures that beat prior baselines. ### What is linear attention, exactly? Regular transformer attention gets expensive as sequences get longer because the work scales quadratically with sequence length. That is why long context is such a pain — double the length and the cost blows up fast. Linear attention is one family of fixes. It tries to keep the cost growing roughly linearly instead, which makes long inputs much more practical. But the catch is quality: many efficient variants save compute by giving up some of the expressive behavior that made standard attention so good. (arxiv.org) ### Why is this design problem so hard? Because “linear attention” is not one trick. It is a huge menu of choices — feature maps, gating, decay, state updates, normalization, mixing rules, memory structure. Small changes can matter a lot, and the combinations multiply quickly. Human researchers usually explore only a thin slice of that space, partly because every serious idea has to be implemented, trained, debugged, and compared. That makes architecture work feel less like inspiration and more like expensive lab work. (arxiv.org) ### So what did ASI‑Arch actually do? Basically, it ran an autonomous research loop. The paper describes a system that proposes new architectural ideas, turns them into executable code, trains them, evaluates them, stores the results, and uses that history to guide the next round. The GitHub repo frames this as a multi‑agent pipeline with an architecture database and a “cognition base” for accumulated research knowledge. In other words, not just brute‑force search over a fixed menu, but a system that keeps generating new candidates as it learns what works. (arxiv.org) ### What were the headline results? The team says ASI‑Arch ran 1,773 autonomous experiments over more than 20,000 GPU hours and discovered 106 novel state‑of‑the‑art linear‑attention architectures. They also say performance kept improving over the course of the run rather than stalling immediately, which matters because it suggests the loop was not just getting lucky once. The repo open‑sources those 106 architectures, though the strongest headline claims still live in the paper and project materials rather than in an independent benchmark roundup. (arxiv.org) ### Why call this an “AlphaGo moment”? The analogy is pretty direct. AlphaGo’s famous move mattered because it showed search could surface strategies strong humans had not considered. Here the authors are making the same argument for model architecture — that an AI system can discover useful design patterns outside the usual human playbook. That does not mean the system “invented intelligence.” It means architecture research may be shifting from handcrafting a few clever ideas to running a disciplined, compute‑heavy discovery process. (arxiv.org) ### Is the scaling-law claim the real story? Maybe, yes. The paper’s boldest claim is not just the 106 architectures. It is that scientific discovery itself may show an empirical scaling law — more compute, more experiments, more breakthroughs. That is still an early claim, and it needs replication in other domains. But if it holds, the implication is huge: model design stops being only a talent bottleneck and starts looking more like something labs can scale with infrastructure. (arxiv.org) ### What should readers be skeptical about? First, this is an arXiv paper, not a settled consensus result. Second, “state of the art” in efficient attention can depend a lot on benchmark choice, model size, and training setup. Third, the search happened in linear attention — a very important niche, but still one niche. So the result is best read as proof that autonomous architecture discovery can work in a hard, valuable subproblem, not proof that AI can now redesign all of deep learning on command. (arxiv.org) ### Bottom line? The interesting thing here is not one magic attention block. It is the shift in how those blocks get found. If ASI‑Arch’s results hold up, the frontier moves from “who has the cleverest idea” to “who can run the best autonomous research loop.” (arxiv.org)