ASI‑Arch finds 106 attention designs
- Shanghai Jiao Tong University researchers posted ASI-Arch, an autonomous multi-agent system that designed, coded, trained, and tested 106 new linear attention architectures. - In one reported run, ASI-Arch executed 1,773 experiments over more than 20,000 GPU hours and the paper claims those designs reached SOTA results. - The bigger shift is methodological — model architecture search is starting to look compute-scalable, not bottlenecked by human researchers alone.
Attention design is one of the most expensive bottlenecks in modern AI. Transformers work well, but standard self-attention gets painfully costly as context windows grow because compute and memory rise roughly with the square of sequence length. That is why labs keep chasing “linear attention” — versions that try to keep the useful parts of attention while making long contexts cheaper. Now a team led by researchers at Shanghai Jiao Tong University says an autonomous system called ASI-Arch found 106 state-of-the-art linear attention designs by running its own research loop. ### What did ASI-Arch actually do? Basically, it did more than sweep hyperparameters. The paper describes ASI-Arch as a multi-agent system that proposes architecture ideas, turns them into executable code, trains them, evaluates them, stores the results, and uses past experiments plus paper-derived “cognitions” to plan the next round. The GitHub repo says the full pipeline, database, and cognitive library are open-sourced, along with all 106 discovered architectures. (arxiv.org) ### Why focus on linear attention? Because vanilla attention is great but expensive. Standard softmax attention compares every token with every other token, which creates quadratic time and memory costs as sequences get longer. Linear attention methods try to rewrite that computation so the cost grows linearly instead, often by compressing past keys and values into a running summary. The tradeoff is that many linear methods lose accuracy or recall compared with full attention. (arxiv.org) ### So what is the actual claim? The headline number is 1,773 autonomous experiments over 20,000 GPU hours, producing 106 “innovative, state-of-the-art” linear attention architectures. The paper frames this as a move from automated optimization to automated innovation — meaning the system was not just searching inside a fixed human-written menu, but generating and testing new design concepts on its own. That is the bold part of the claim. (arxiv.org) ### Why call this an “AlphaGo moment”? The comparison is not about beating humans in a game. It is about discovering weird but effective moves humans were not naturally finding. The paper says some of the AI-designed architectures exposed emergent design principles that systematically beat human baselines, in the same spirit as AlphaGo’s famous unexpected moves. That framing is a little dramatic, but the core idea is clear — use AI to search parts of model design that are too large and too unintuitive for researchers to explore manually. (arxiv.org) ### Is this just old-school neural architecture search? Not quite. Traditional NAS usually searches inside a space humans define in advance. ASI-Arch is pitched as a research agent that also writes hypotheses, code, and experiment plans, then updates its own direction from results. So the novelty is less “AI searched architectures” — that part is old — and more “AI ran a bigger chunk of the scientific loop itself.” ### What is the catch? (arxiv.org) The catch is that these are paper claims, not yet a field-wide consensus. “State of the art” depends on the benchmarks, scales, and baselines you choose. And linear attention is a crowded area — lots of methods look promising in narrow settings before hitting quality tradeoffs at larger scales or harder tasks. So the important thing is not that 106 designs now replace transformers everywhere. It is that autonomous systems are getting good enough to generate serious candidates for humans to verify. ### Why does this matter beyond attention? Because if architecture discovery itself scales with compute, then AI research starts changing shape. The paper explicitly claims a first empirical scaling law for scientific discovery in this setting — more compute, more breakthroughs. If that pattern holds beyond this project, labs may increasingly aim AI at the guts of AI: attention blocks, optimizers, training recipes, memory systems, maybe even whole model families. (arxiv.org) ### Bottom line The immediate news is 106 new linear attention designs. The deeper news is that the machine may be inching from “tool for researchers” toward “researcher for tools.” If that keeps working, the pace limit on AI progress stops being only human imagination and starts looking a lot more like available compute. (arxiv.org)