NVIDIA, Sakana unveil TwELL speedup
- NVIDIA and Sakana AI disclosed TwELL, a new sparse-matrix format plus CUDA kernels that make sparse transformer layers run efficiently on GPUs. - The key claim is practical, not theoretical — over 99% sparsity in feedforward layers with 20%+ inference speedups, plus lower memory use and energy. - That matters because sparse LLMs have looked promising for years, but irregular data layouts kept GPUs from turning fewer operations into real savings.
Large language models waste a lot of work. Huge chunks of their feedforward layers barely matter for any given token, but GPUs still like to process those layers as if every weight is equally active. That mismatch has been the annoying part of sparse AI for years — the math says you should save compute, but the hardware often refuses to cash the check. NVIDIA and Sakana AI are arguing they finally closed enough of that gap to make sparse LLMs meaningfully faster in practice. ### What is TwELL, exactly? TwELL is a new way to pack sparse weights in memory so NVIDIA GPUs can process them more cleanly. The paper pairs that format with custom CUDA kernels tuned for modern GPU execution paths, instead of treating sparsity like an awkward add-on. Basically, the trick is not just “remove weights.” The trick is “remove weights in a layout the GPU can still chew through efficiently.” ### Why was sparsity the hard version? (arxiv.org) Because GPUs love dense, regular blocks of numbers. Sparse models produce irregular patterns — lots of zeros, but scattered in messy ways. In theory, skipping zeros should save work. In practice, the bookkeeping can become its own tax: extra indexing, bad memory access, and kernels that stall instead of flying. TwELL is meant to reduce that tax by reshaping sparse computation around the hardware rather than asking the hardware to tolerate chaos. ### Where in the model does this help? This work targets the feedforward layers of transformer LLMs, not the attention mechanism. That matters because those feedforward blocks hold most of the parameters and a big share of execution FLOPs in autoregressive models. So if you can make those blocks sparse without wrecking quality, you hit one of the fattest parts of the inference bill. ### How big are the claimed gains? (arxiv.org) The headline numbers are pretty aggressive. The authors say simple L1 regularization can push feedforward layers past 99% sparsity with negligible downstream quality loss, and that pairing those sparse models with TwELL kernels yields 20%+ throughput gains, along with better energy efficiency and lower memory use. The important nuance is that these are system-level gains from the combination of sparse training choices plus hardware-aware kernels — not just a magic file format dropped onto any model. ### Why does NVIDIA matter here? Because this is not just an algorithm paper. It is a kernel paper. NVIDIA’s side of the collaboration shows up in the CUDA implementation and in tuning for H100-class hardware. The public codebase says the custom kernels are designed for H100 GPUs and uses CUDA 12.8+ tooling, which tells you the target is real datacenter deployment, not a toy benchmark on consumer cards. (arxiv.org) ### Is this already an ICML 2026 result? Not exactly in the way the initial chatter suggests. The paper is on arXiv from March 24, 2026, and the code is already public on GitHub. ICML 2026 itself is scheduled for July 6 to 11 in Seoul, so this looks more like a current research release that may circulate around the conference season, not something “presented at ICML today.” ### What is the real takeaway? The interesting part is not that sparse LLMs exist. (github.com) We already knew that. The interesting part is that NVIDIA and Sakana are trying to make sparsity operational — something you can actually deploy for faster inference instead of admire in a paper. If the 20%+ speedup holds up outside their setup, that is real money for anyone serving large models at scale. ### Bottom line TwELL looks like a serious attempt to turn “most weights don’t matter” into an actual systems win. (arxiv.org) The catch is that it depends on the whole stack lining up — training recipe, sparse layout, kernels, and GPU target. But that is also why this one matters. It is less a new theory of AI, and more a blueprint for making sparse transformers finally pay rent. (arxiv.org)