Alibaba Qwen ships FlashQLA kernel
- Alibaba’s Qwen team open-sourced FlashQLA on April 29, a TileLang kernel library for Gated DeltaNet linear attention, aimed at faster Hopper inference and training. - The headline claim is 2–3× faster forward passes and roughly 2× faster backward passes versus FLA’s Triton kernels on NVIDIA Hopper GPUs. - It matters because linear attention only wins if kernels are fast too — and Qwen is trying to close that systems gap.
GPU kernels are the unglamorous part of AI that decides whether a model feels fast or sluggish. That matters even more for newer attention designs, where the math can look cheaper on paper but still lose in practice if the low-level code is weak. That is the gap Alibaba’s Qwen team is trying to close with FlashQLA — a new open-source kernel library released on April 29 for Gated DeltaNet linear attention on NVIDIA Hopper GPUs. (github.com) ### What did Qwen actually ship? FlashQLA is a kernel library built on TileLang, the DSL and compiler stack for writing high-performance accelerator code. Qwen published it as an open-source GitHub repo under MIT license, and the project is specifically framed around speeding up the forward and backward passes of GDN Chunked Prefill — the linear-attention path used in Qwen’s newer model work. (github.com) ### What problem is it solving? The short version is that “linear attention” is not automatically fast. Standard softmax attention gets expensive as sequence length grows, so model builders keep looking for alternatives with better scaling. But if the replacement attention mechanism depends on immature kernels, the theoretical gain leaks away. FlashQLA is Qwen saying: the algorithm is not en(github.com) exploit the GPU properly. (github.com) ### Why Gated DeltaNet? Because that is the specific attention family Qwen is optimizing around. The repo says FlashQLA targets Gated DeltaNet, or GDN, and more narrowly the “Chunked Prefill” path. That makes this less of a general-purpose attention drop-in and more of a sharp tool for one architecture choice. Basically, Qwen is optimizing the part of the stack it already cares about, rather(github.com)kload. (github.com) ### How big are the speedups? Qwen’s headline numbers are straightforward: 2–3× forward speedup and about 2× backward speedup over the FLA Triton kernel across multiple Hopper scenarios. The repo also says the gains show up most strongly in pretraining and edge-side agentic inference. Those are meaningful numbers if they hold in real deployments, because attention kernels sit directly on the latency path. (github.com) ### Where is the speed coming from? Qwen points to two main tricks. One is gate-driven automatic intra-card context parallelism. The other is warp-specialized pipeline design, which is about keeping different parts of the GPU busy instead of stalling them. Think of it like reorganizing a kitchen so prep, cooking, and plating happen in parallel instead of everyone waiting on the same counter. (github.com)r gains” become real throughput. (github.com) ### Why does Hopper matter here? Because FlashQLA is not being pitched as universally portable. The repo benchmarks and claims are centered on NVIDIA Hopper, and outside writeups describe the current target as Hopper-specific. That means the news is strongest for H100 and H200-class deployments, not for every inference fleet. The catch is that kernel wins can be architecture-sensitive — grea(github.com). (github.com) ### Does this matter beyond Qwen? Yes — but in a narrow, systems-level way. If more model families adopt linear-attention variants, they will need the same kind of kernel work to make those designs commercially useful. FlashQLA is one more sign that AI competition is shifting downward into infrastructure details: compilers, kernels, memory movement, and scheduling. The model architecture sti(github.com)singly where the practical advantage shows up. (github.com) ### Bottom line? FlashQLA is not a flashy new model. It is the plumbing underneath one. But that plumbing is exactly what decides whether a clever attention mechanism becomes a real product advantage. Qwen’s release matters because it turns a niche architectural bet — Gated DeltaNet on Hopper — into something faster, more open, and easier for other builders to test. (github.com)