Researchers debate Transformers and post‑Transformer

- On May 20, researchers including Lukasz Kaiser publicly debated whether Transformers can keep scaling or whether post-Transformer designs must replace them. (theneuron.ai) - The debate centered on attention’s O(n²) cost, with Llion Jones, Mathias Lechner and Adrian Kosowski arguing memory and efficiency bottlenecks now matter most. (theneuron.ai) - The full Pathway debate video is available on YouTube, with Kaiser, Jones, Lechner and Kosowski as participants. (youtube.com)

On May 20, a public debate among AI researchers including Transformer co-author Lukasz Kaiser put a basic question back on the table: is the Transformer architecture still the right foundation for frontier models, or are newer designs needed for the next jump. Pathway hosted the event as a live “Transformers vs. Post-Transformers” showdown featuring Kaiser, Llion Jones, Mathias Lechner and Adrian Kosowski. (theneuron.ai) The argument was not over whether Transformers work. It was over whether their current strengths can extend to longer context, durable memory and lower-cost reasoning without a more fundamental architectural change. The discussion drew attention because the participants came from both the original Transformer lineage and newer architecture efforts. (youtube.com) Kaiser and Jones were co-authors of the 2017 “Attention Is All You Need” paper, while Lechner has argued for alternatives through Liquid AI and Kosowski has promoted Pathway’s BDH architecture. A May 20 post by probnstat on X cited the exchange as a public debate over scaling laws, memory and post-Transformer directions. ### Why are researchers arguing about Transformers now? The Transformer remains the dominant model architecture behind large language models because attention lets the model relate tokens across a sequence in a flexible way. (theneuron.ai) But that same mechanism becomes expensive as context grows, because standard self-attention scales roughly with the square of sequence length. In the Pathway debate, that cost was presented as one reason researchers are looking again at alternatives for long-horizon tasks. Llion Jones argued in the event that the field risks architectural complacency if it assumes more scale alone will solve current limitations, according to The Neuron’s account of the debate. (theneuron.ai) That framing connected the architecture question to a broader research concern: whether current scaling trends can keep delivering gains at acceptable compute cost. ### What is the “post-Transformer” case actually about? Mathias Lechner and Adrian Kosowski used the debate to press the case that memory, efficiency and continual adaptation are not side issues but central design constraints. The Neuron’s summary said the post-Transformer side focused on persistent memory, continual learning, reasoning over long horizons and hardware fit rather than raw benchmark inheritance from today’s models. (theneuron.ai) Persistent memory is a key part of that case. Standard Transformers can condition on long context windows, but they do not natively maintain an explicit, addressable long-term memory in the way many researchers want for agents or systems that learn over time. (theneuron.ai) Recent papers on learned memory in Transformers have tried to add separate memory banks and routing mechanisms, reflecting the same pressure point raised in the debate. ### What did the pro-Transformer side say in response? Lukasz Kaiser’s side of the argument was not that Transformers are perfect. It was that the architecture still has room to improve through better engineering, scaling and extensions before the field declares it obsolete, according to the event summary. That position reflects a pattern already familiar in AI: bottlenecks that first look architectural sometimes yield to better training, system design or hardware. (theneuron.ai) A separate line of current research also supports the idea that attention can be understood as a form of associative memory rather than only a brute-force pattern matcher. An ICLR 2026 workshop talk based on a 2024 paper framed Transformer performance through associative-memory theory and argued empirical scaling laws alone do not explain all behavior. (arxiv.org) ### Why do memory and continual learning keep coming up? Continual learning matters because many researchers want models that can update over time without catastrophic forgetting. Long-horizon reasoning matters because coding agents, research agents and other systems increasingly need to carry goals and facts across many steps, not just one prompt window. Those requirements put pressure on architectures that rely on ever-larger context windows and expensive KV-cache handling. (theneuron.ai) Hardware also sits inside the argument. The Neuron’s account said the debate repeatedly returned to hardware co-evolution, meaning the winning architecture may be the one that best matches how chips, memory bandwidth and inference systems are actually built. (arxiv.org) ### Where does this leave the field next? The May 20 debate did not produce a winner in the technical sense, even if the live audience favored the Transformer side, according to The Neuron’s account. What it did produce was a clearer checklist for whatever comes next: lower-than-O(n²) sequence handling, more explicit memory, better continual learning and a design that fits future hardware. (theneuron.ai) The next public reference point is the full Pathway video on YouTube, which documents the exchange among Kaiser, Jones, Lechner and Kosowski. Researchers tracking the issue are also likely to keep watching new memory-augmented and alternative-sequence-model papers, where those claims can be tested against training cost and long-context performance. (theneuron.ai) (youtube.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.