Moonshot open-sources FlashKDA

- Moonshot AI open-sourced FlashKDA, a prefill optimization for faster model input handling on GPUs. - Their benchmarks show 1.72×–2.22× faster prefill on H20 GPU configurations. - The code and performance claims were published publicly today, pitching faster prefill as an inference efficiency gain. (x.com)

Large language models spend one phase reading the prompt before they answer, and Moonshot AI has now open-sourced code it says makes that reading step faster on Nvidia Hopper-class GPUs. (github.com) Moonshot published FlashKDA on GitHub on April 21, 2026, describing it as “high-performance Kimi Delta Attention” built on CUTLASS, Nvidia’s CUDA template library for matrix math. The repository lists support for SM90-and-above GPUs, CUDA 12.9-and-above, and PyTorch 2.4-and-above. (github.com) The company’s benchmark file says FlashKDA cut forward-pass time on an H20 setup from 4.5052 milliseconds to 2.6219 milliseconds in one 8,192-token, 96-head test, a 1.72× speedup. In two variable-length tests with the same token length and 96 heads, Moonshot reported 1.95× and 2.22× gains over the flash-linear-attention baseline. (github.com) In plain terms, prefill is the stage where a model turns your whole prompt into internal memory before it starts generating the first token. Moonshot’s own recent paper describes prefill as compute-intensive, while the later decode stage is constrained more by memory bandwidth. (arxiv.org) That split has become important because model serving stacks increasingly separate prefill from decode and place them on different hardware. Moonshot’s April 2026 “Prefill-as-a-Service” paper says that architecture is now standard in large-scale serving and argues that cheaper prefill can improve throughput in heterogeneous deployments. (arxiv.org) FlashKDA is not a general speedup for every transformer attention path. The code is a kernel for Kimi Delta Attention, Moonshot’s hybrid attention mechanism, and the README says it can be used as a backend for the flash-linear-attention project rather than as a drop-in replacement for all models. (github.com) Moonshot’s engineering note says the kernel gets part of its speed from cutting work into two GPU kernels instead of one. The company wrote that splitting the token-parallel and recurrence-heavy parts yielded at least a 15% end-to-end speedup in its internal testing. (github.com) The same note says FlashKDA uses chunks of 16 tokens instead of 64, partly to keep values inside bfloat16 range and partly because a 16-by-16 matrix inversion is cheaper than a 64-by-64 one. Moonshot also says storing recurrent state in bfloat16 cut shared-memory use roughly in half without measurable accuracy loss in its inference benchmarks. (github.com) The release is early-stage code rather than a packaged product. As of April 21, the repository showed two commits, no listed releases, and benchmark results generated from Moonshot’s own script on H20 hardware. (github.com (github.com 1) (github.com 2) For model operators already using Moonshot’s Kimi Delta Attention stack, the pitch is simple: spend less GPU time digesting long prompts before generation starts. The next test is whether outside developers reproduce the H20 numbers on their own workloads. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.