Local kernel trick speeds Qwen inference
A developer fused a single CUDA kernel for every layer of Qwen 3.5‑0.8B and reported 411 tokens/sec on a 2020 RTX 3090—about 1.55× faster than llama.cpp and faster than an Apple M5 Max at 229 tokens/sec. (x.com) The result highlights how careful, hardware‑specific software engineering can materially close performance gaps on older GPUs. (x.com)
Running a language model is often a traffic problem as much as a math problem, and one developer says they cut that traffic enough to push Qwen 3.5-0.8B to about 411 tokens a second on an Nvidia GeForce RTX 3090 from 2020. (github.com) The project, published this week as Luce Megakernel, says it fused all 24 layers of Qwen 3.5-0.8B into one persistent CUDA dispatch instead of launching many small graphics-processor jobs for each token. The repository reports 413 tokens a second at stock settings and 411 tokens a second with the card power-limited to 220 watts. (github.com) A token is a small chunk of text, and inference is the step where a trained model predicts the next chunk over and over. In plain terms, faster inference means a chatbot can stream words sooner and with less wasted hardware time. (github.com) The bottleneck here is not just raw chip speed. The Luce write-up says a normal setup can spend each token bouncing between the central processor and the graphics processor across roughly 100 kernel launches, with each layer boundary forcing another dispatch, memory fetch, and synchronization. (github.com) That matters now because Qwen 3.5 is new. Alibaba’s Qwen team released the Qwen 3.5 family on February 16, 2026, then added the 0.8 billion, 2 billion, 4 billion, and 9 billion parameter models on March 2, 2026. (qwen.ai, github.com) It also matters because Qwen 3.5 does not use a plain transformer stack. Qwen says the family uses a hybrid design that mixes linear attention through Gated Delta Networks with standard attention, and the Luce developer argues no fused kernel had been built for that exact pattern before. (qwen.ai, github.com) The headline comparison is against generic local inference software. The Luce repository lists llama.cpp at 267 tokens a second on the same RTX 3090 for decode, which would make the fused kernel about 1.55 times faster on that test. (github.com, github.com) The project also compares itself with Apple’s newest high-end laptop chip. Luce lists an Apple M5 Max result of 229 tokens a second, while Apple introduced MacBook Pro models with M5 Max on March 3, 2026 and pitched them for on-device artificial intelligence workloads. (github.com, apple.com) The more provocative claim is about efficiency, not just speed. Luce says the RTX 3090 reached 1.87 tokens per joule at a 220-watt limit, versus 0.76 tokens per joule for llama.cpp on the same card and 1.76 tokens per joule for the M5 Max result it cites. (github.com) That does not prove every Nvidia card suddenly beats every Apple laptop, because these are single-project benchmarks on one small model under specific settings. But it does show how much performance can hinge on software that is tuned for one architecture instead of a general-purpose runtime. (github.com, github.com) There is already a next step in that line of work. A February 9, 2026 post by Alpin Dale said a descendant of Elliot Arledge’s MegaQwen work had pushed a related Qwen 3-0.6B kernel to about 1,000 tokens a second on a GeForce RTX 5090, again by shaving microseconds from launch overhead and memory movement. (blog.alpindale.net) The bigger takeaway is less about one benchmark than about an old card getting new life. If a 2020 RTX 3090 can close much of the gap through a single carefully fused kernel, local model performance may depend as much on who writes the runtime as on who fabbed the chip. (github.com)