TTT‑E2E long‑input trick

Researchers unveiled a TTT‑E2E method that updates a language model’s weights during inference to keep accuracy stable on very long inputs. (x.com) The approach is being discussed as a way to avoid the usual quality drop when models handle extended documents or long conversations. (x.com)

Language models usually read long inputs by keeping a growing scratchpad of earlier tokens, and that scratchpad gets slower and less reliable as documents and chats stretch out. A December 2025 paper reports a different approach: update the model’s own weights while it is reading, so accuracy stays close to full-attention transformers even on very long contexts. (arxiv.org) The method is called End-to-End Test-Time Training, or TTT-E2E, and the paper was posted to arXiv on December 29, 2025, with a revised version on December 31. The author list includes Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, and Yu Sun. (arxiv.org) The basic idea is to treat long-context reading as continual learning instead of a memory-architecture problem. The model uses a standard transformer with sliding-window attention, then keeps training itself on the current sequence with next-token prediction, folding what it has read into its parameters. (arxiv.org) Most large language models are trained first and then frozen at deployment, so they can only “remember” long inputs by carrying more tokens forward in attention or cache. TTT-E2E changes that by letting the model adapt during inference, sequence by sequence, rather than relying only on a fixed context window. (developer.nvidia.com) The paper says its 3 billion parameter models were trained on 164 billion tokens, and that TTT-E2E matched the way full-attention transformers scale with context length while alternatives such as Mamba 2 and Gated DeltaNet did not. The same paper says inference latency stays constant with context length, like a recurrent model, instead of rising with every added token. (arxiv.org) On speed, the paper reports TTT-E2E was 2.7 times faster than full attention at 128,000 tokens of context. Nvidia’s technical write-up says the gap reached 35 times at 2 million tokens on an Nvidia H100, while keeping the same constant-latency pattern. (arxiv.org, developer.nvidia.com) That tradeoff moves cost from serving to training. Nvidia’s write-up says the meta-learning stage used to prepare the model for these test-time updates is currently 3.4 times slower than standard pretraining at short contexts, in part because common fast-attention kernels do not yet support the needed higher-order gradients. (developer.nvidia.com) The researchers released code on GitHub in a JAX implementation and published model checkpoints and experiment configs there. The repository says the setup was tested on graphics processors with CUDA 12.8.1, cuDNN 9.8.0, and NCCL 2.26.2. (github.com) The pitch is straightforward: long reports, codebases, logs, and extended chat histories often break today’s models because attention cost rises with every token. TTT-E2E tries to turn that growing prompt into temporary learning instead of ever-larger memory, so the model keeps reading without dragging the whole past behind it. (arxiv.org, developer.nvidia.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.