Sakana AI speeds H100 inference

- Sakana AI and NVIDIA published TwELL, a sparse LLM kernel system for H100 GPUs, and said it turns feedforward-layer sparsity into real speed. - The headline result is about 20.5% faster inference and 21.9% faster training, with over 99% sparsity induced in feedforward activations. - If those gains hold in production, sparse models stop being a research curiosity and start looking like real infrastructure leverage.

Large language model inference is mostly a GPU plumbing problem now. The math is huge, the memory traffic is brutal, and the annoying part is that even when models naturally do less work, the hardware often doesn’t get faster. That is the gap Sakana AI and NVIDIA are trying to close with TwELL — a new sparse data format plus custom CUDA kernels for NVIDIA H100s that were published in May 2026. The pitch is simple: don’t ask the GPU to tolerate messy sparsity. Reshape the sparsity so the GPU can still run like it wants to. ### What is the bottleneck here? In transformer LLMs, the feedforward layers are the expensive middle of the sandwich. They hold most of the parameters and eat most of the FLOPs, so they are the obvious place to hunt for savings. Sakana and NVIDIA focus there, not on attention, because that is where the models already show a lot of activation sparsity — many neurons stay silent for a given token. (sakana.ai) ### Why doesn’t sparsity already make GPUs faster? Because GPUs love regularity. Dense matrix multiplies line up neatly with tensor cores, tiled kernels, and predictable memory access. Unstructured sparsity breaks that rhythm — you save arithmetic on paper, but then lose it back in scatter-gather overhead, branching, and ugly memory movement. Basically, the model does less math, but the chip has to work harder to find the math that remains. (arxiv.org) ### So what is TwELL actually doing? TwELL stands for Tile-wise ELLPACK. The key idea is to pack sparse values in a way that still fits the tiled matrix-multiply structure modern NVIDIA GPUs expect. Sakana describes it as a hybrid path: roughly 99% of highly sparse tokens go through a fast sparse route, while the rare “heavy” tokens fall back to a dense backup path. That keeps the common case fast without letting outliers wreck utilization. (sakana.ai) ### What changed this week? The research moved from a paper idea to a clearer implementation story. Sakana AI published the TwELL blog on May 9, 2026, tied to an ICML 2026 paper and open-source code release, and framed it as a direct collaboration with NVIDIA. The paper says the team built custom CUDA kernels that fuse sparse matrix multiplies and compress activations into a hybrid representation to cut throughput losses and memory overhead. (sakana.ai) ### How big are the gains? The headline is a little over 20% on both sides of the stack. Sakana says the kernels delivered more than 20% speedups in inference and training at billion-parameter scale, and the paper ties those gains to very high sparsity induced with simple L1 regularization — over 99% in feedforward activations — while keeping downstream quality largely intact. The same setup also reduced peak memory use and improved energy efficiency. (sakana.ai) ### Why does H100 matter so much? Because H100 is still one of the reference GPUs for serious model serving and training. If a method works specifically with Hopper-style tiled execution instead of fighting it, that matters more than a clever sparsity paper that only wins in theory. TwELL is interesting precisely because it is architecture-aware — it tries to fit sparse workloads into the execution model H100 already optimizes well. (sakana.ai) ### What’s the catch? This is still sparse-model infrastructure, not a drop-in speed button for every dense model in production. You need a model and training recipe that actually produce the right sparsity pattern, and you need the kernel path integrated into real serving stacks. The paper and blog make the case that the accuracy hit can be small, but production teams will still care about tail latency, kernel maturity, tooling, and whether the gains survive outside benchmark conditions. (developer.nvidia.com) ### Bottom line The important part is not just “20% faster.” It is that Sakana AI and NVIDIA are arguing sparse LLMs can finally run in a way GPUs reward instead of punish. If that holds up, H100-era inference economics get a little less brutal — and a lot more interesting. (sakana.ai) (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.