SwiGLU kernel cuts memory traffic
What happened
- Subhadip Mitra’s TritonMoE work described a fused SwiGLU gate-and-up GEMM that computes both projections together and cuts global memory traffic by 35%. - The paper reported 89% to 131% of Megablocks throughput on an NVIDIA A100 at inference batch sizes up to 512 tokens. - Code is available in the bassrehab/triton-kernels repository, with portability tests reported on NVIDIA A100 and AMD MI300X.
Why it matters
A new optimization making the rounds in inference circles is not a new model, but a new way to execute one of the most common feed-forward blocks inside transformer systems. In a recent TritonMoE paper and accompanying code release, Subhadip Mitra described a fused gate+up GEMM for SwiGLU that computes both projections from the same input tile, applies SiLU in registers, and avoids writing intermediate results back out to high-bandwidth memory. The paper says that change eliminates 35% of global memory traffic in that stage of execution and helps the kernel stay closer to the bandwidth limits that dominate large-model inference. ### Why does SwiGLU create so much memory traffic in the first place? SwiGLU shows up in many modern transformer feed-forward networks because it replaces a plain two-layer MLP with a gated path: one projection produces the “up” values, another produces the “gate,” and the result is multiplied after applying SiLU to the gate. In a straightforward implementation, that means reading the same input activations for separate matrix multiplies, materializing intermediates, and then reading them again for the elementwise activation-and-multiply step. (arxiv.org) A prior Bitdefender AI Research write-up on fused gated MLP kernels described this pattern as memory-inefficient and argued that fusing the gating path can sharply reduce memory use. ### What exactly is being fused here? The TritonMoE paper says the kernel fuses the gate and up GEMMs so both SwiGLU projections are computed from shared L2-cached input tiles rather than from separate reloads of the same data. It also keeps the SiLU activation in registers, which removes another trip through global memory before the gated product is formed. In plain terms, the kernel is trying to keep more of the work on-chip — in cache and registers — instead of bouncing partial results out to HBM and back. (arxiv.org) The GitHub repository for `bassrehab/triton-kernels` presents the same design goal more generally: large-language-model inference is often memory-bandwidth bound, so custom kernels try to reduce memory round-trips and improve bandwidth utilization by fusing operations. The repository lists a `swiglu_fused` kernel and a fused MoE forward path among its optimized components. (arxiv.org) ### Why would a 35% traffic cut matter more than a small math tweak? The repository’s README says a 7B-parameter FP16 model requires loading about 14 GB of weights for each forward pass, and on an A100 that memory movement can dominate runtime while the arithmetic itself is comparatively brief. That is why kernel engineers focus so heavily on bandwidth: if a block is memory-bound, reducing bytes moved can matter more than reducing floating-point operations. (github.com) The TritonMoE paper ties that directly to MoE serving. It says naive MoE inference launches separate gate, up and down projection kernels for each expert, producing many small launches and repeated memory movement. For Mixtral with eight experts, the paper says that is 24 kernel launches per layer; for DeepSeek-V3 with 256 experts, it says that rises to 768. ### Did the authors show an end-to-end gain, not just a kernel trick? (github.com) On NVIDIA A100 tests, the paper says TritonMoE reached 89% to 131% of the throughput of CUDA-optimized Megablocks at inference batch sizes of 512 tokens or less across Mixtral-8x7B, DeepSeek-V3 and Qwen2-MoE configurations. The authors also reported that 162 correctness tests passed on both NVIDIA A100 and AMD MI300X with zero code changes, framing the work as a portability play as well as a speed optimization. (arxiv.org) That does not mean every serving stack gets a 35% end-to-end speedup. The paper’s claim is narrower: the fused gate+up GEMM removes 35% of global memory traffic for that part of the pipeline. The system-level gain depends on how much of total inference time is spent in SwiGLU and MoE expert execution, plus how well the serving framework can integrate the kernel. That is an inference from the paper’s benchmark framing and repository description, not a separate claim by the authors. (arxiv.org) ### Where would this show up next? The code is already public in the `bassrehab/triton-kernels` repository, and the paper positions it as part of a portable Triton-based MoE inference stack rather than a one-off benchmark artifact. If serving frameworks pick it up, the likely target is the transformer feed-forward path in SwiGLU-based and MoE-heavy models where bandwidth pressure is already the bottleneck. (arxiv.org) For operators, the practical watchpoint is not a model announcement but kernel adoption: whether frameworks such as Triton-based inference stacks, custom MoE runtimes, or broader serving systems start using fused gate+up execution in production builds. The paper and repository are the places to track that next step. (arxiv.org)
Key numbers
- Subhadip Mitra’s TritonMoE work described a fused SwiGLU gate-and-up GEMM that computes both projections together and cuts global memory traffic by 35%.
- The paper reported 89% to 131% of Megablocks throughput on an NVIDIA A100 at inference batch sizes up to 512 tokens.
- Code is available in the bassrehab/triton-kernels repository, with portability tests reported on NVIDIA A100 and AMD MI300X.
- The paper says that change eliminates 35% of global memory traffic in that stage of execution and helps the kernel stay closer to the bandwidth limits that dominate large-model inference.
What happens next
- It says naive MoE inference launches separate gate, up and down projection kernels for each expert, producing many small launches and repeated memory movement.
- For Mixtral with eight experts, the paper says that is 24 kernel launches per layer; for DeepSeek-V3 with 256 experts, it says that rises to 768.
- (arxiv.org) Where would this show up next?
Sources
Quick answers
What happened in SwiGLU kernel cuts memory traffic?
Subhadip Mitra’s TritonMoE work described a fused SwiGLU gate-and-up GEMM that computes both projections together and cuts global memory traffic by 35%. The paper reported 89% to 131% of Megablocks throughput on an NVIDIA A100 at inference batch sizes up to 512 tokens. Code is available in the bassrehab/triton-kernels repository, with portability tests reported on NVIDIA A100 and AMD MI300X.
Why does SwiGLU kernel cuts memory traffic matter?
A new optimization making the rounds in inference circles is not a new model, but a new way to execute one of the most common feed-forward blocks inside transformer systems. In a recent TritonMoE paper and accompanying code release, Subhadip Mitra described a fused gate+up GEMM for SwiGLU that computes both projections from the same input tile, applies SiLU in registers, and avoids writing intermediate results back out to high-bandwidth memory. The paper says that change eliminates 35% of global memory traffic in that stage of execution and helps the kernel stay closer to the bandwidth limits that dominate large-model inference. Why does SwiGLU create so much memory traffic in the first place? SwiGLU shows up in many modern transformer feed-forward networks because it replaces a plain two-layer MLP with a gated path: one projection produces the “up” values, another produces the “gate,” and the result is multiplied after applying SiLU to the gate. In a straightforward implementation, that means reading the same input activations for separate matrix multiplies, materializing intermediates, and then reading them again for the elementwise activation-and-multiply step. (arxiv.org) A prior Bitdefender AI Research write-up on fused gated MLP kernels described this pattern as memory-inefficient and argued that fusing the gating path can sharply reduce memory use. What exactly is being fused here? The TritonMoE paper says the kernel fuses the gate and up GEMMs so both SwiGLU projections are computed from shared L2-cached input tiles rather than from separate reloads of the same data. It also keeps the SiLU activation in registers, which removes another trip through global memory before the gated product is formed. In plain terms, the kernel is trying to keep more of the work on-chip — in cache and registers — instead of bouncing partial results out to HBM and back. (arxiv.org) The GitHub repository for bassrehab/triton-kernels presents the same design goal more generally: large-language-model inference is often memory-bandwidth bound, so custom kernels try to reduce memory round-trips and improve bandwidth utilization by fusing operations. The repository lists a swiglu_fused kernel and a fused MoE forward path among its optimized components. (arxiv.org) Why would a 35% traffic cut matter more than a small math tweak? The repository’s README says a 7B-parameter FP16 model requires loading about 14 GB of weights for each forward pass, and on an A100 that memory movement can dominate runtime while the arithmetic itself is comparatively brief. That is why kernel engineers focus so heavily on bandwidth: if a block is memory-bound, reducing bytes moved can matter more than reducing floating-point operations. (github.com) The TritonMoE paper ties that directly to MoE serving. It says naive MoE inference launches separate gate, up and down projection kernels for each expert, producing many small launches and repeated memory movement. For Mixtral with eight experts, the paper says that is 24 kernel launches per layer; for DeepSeek-V3 with 256 experts, it says that rises to 768. Did the authors show an end-to-end gain, not just a kernel trick? (github.com) On NVIDIA A100 tests, the paper says TritonMoE reached 89% to 131% of the throughput of CUDA-optimized Megablocks at inference batch sizes of 512 tokens or less across Mixtral-8x7B, DeepSeek-V3 and Qwen2-MoE configurations. The authors also reported that 162 correctness tests passed on both NVIDIA A100 and AMD MI300X with zero code changes, framing the work as a portability play as well as a speed optimization. (arxiv.org) That does not mean every serving stack gets a 35% end-to-end speedup. The paper’s claim is narrower: the fused gate+up GEMM removes 35% of global memory traffic for that part of the pipeline. The system-level gain depends on how much of total inference time is spent in SwiGLU and MoE expert execution, plus how well the serving framework can integrate the kernel. That is an inference from the paper’s benchmark framing and repository description, not a separate claim by the authors. (arxiv.org) Where would this show up next? The code is already public in the bassrehab/triton-kernels repository, and the paper positions it as part of a portable Triton-based MoE inference stack rather than a one-off benchmark artifact. If serving frameworks pick it up, the likely target is the transformer feed-forward path in SwiGLU-based and MoE-heavy models where bandwidth pressure is already the bottleneck. (arxiv.org) For operators, the practical watchpoint is not a model announcement but kernel adoption: whether frameworks such as Triton-based inference stacks, custom MoE runtimes, or broader serving systems start using fused gate+up execution in production builds. The paper and repository are the places to track that next step. (arxiv.org)