TokenSpeed open‑sources TensorRT rival
- LightSeek Foundation released TokenSpeed on GitHub under MIT, pitching a new open inference engine that targets TensorRT-LLM speed without giving up vLLM-style APIs. - The repo says TokenSpeed is a preview built for agentic workloads, with a static compiler, KV-cache safety checks, and pluggable kernels. - That matters because fast inference has been drifting toward vendor-tuned stacks; TokenSpeed tries to make that performance layer portable and hackable again.
Inference engines are the layer that turns a model checkpoint into something you can actually serve. They decide latency, throughput, memory use, and usually how painful deployment feels. For the last year, the tradeoff has been annoying — easy stacks like vLLM on one side, heavily optimized stacks like TensorRT-LLM on the other. LightSeek’s new TokenSpeed project is trying to collapse that split by open-sourcing a runtime that explicitly aims for TensorRT-LLM-class performance with vLLM-like usability. (github.com) ### What actually shipped? LightSeek Foundation put TokenSpeed on GitHub as an MIT-licensed open-source project. The repo describes it as an inference engine “designed for agentic workloads,” and the codebase is already split into major subsystems — a modeling layer, scheduler, kernel layer, MLA components, docs, and Python bindings. But the catch is right in the README: this is a preview release, not something LightSeek wants you dropping into production this week. (github.com) ### Why compare it to TensorRT-LLM? Because TensorRT-LLM is the obvious benchmark if you care about squeezing maximum speed out of NVIDIA hardware. NVIDIA’s own DGX Spark materials frame TensorRT-LLM as the optimized path for lower latency and higher throughput through custom kernels, memory layouts, quantization, and parallelism strategies. So when TokenSpeed says “TensorRT-LLM-level performance,” it is not making(github.com)ng stack in this category. (build.nvidia.com) ### What does “vLLM-level usability” mean here? Basically — don’t make users hand-wire the ugly parts. TokenSpeed’s README says its modeling layer uses a local-SPMD design plus a static compiler that generates collective communication from placement annotations, instead of forcing users to write parallelism logic themselves. It also keeps a Python execution plane, which matters because a lot of teams want to customize serv(build.nvidia.com)s the vLLM part of the pitch: OpenAI-style serving ergonomics and a friendlier developer surface, not just raw kernel heroics. (github.com) ### Why is “agentic workload” the focus? Because agent systems are a nastier inference problem than plain chatbot demos. They keep long contexts alive, revisit KV cache over many turns, branch into tool calls, and create lots of small but latency-sensitive requests. TokenSpeed is built around that shape. The repo highlights finite-state-machine scheduling, compile-time checks around KV cache ownership and reuse, an(github.com)ng to stop the runtime from tripping over its own memory while many semi-independent tasks are flying around. (github.com) ### Is there proof beyond the pitch? Some, but not the kind you should overread yet. The public repo includes a performance-comparison section and says the current preview is meant to reproduce TokenSpeed blog results for Kimi K2.5 on B200 and TokenSpeed MLA on B200. At the same time, the maintainers also say major pull requests are still missing, model coverage is incomplete, and features like KV store, VLM suppor(github.com)o this is a serious technical release, but still an early one. (github.com) ### Where does vLLM fit into this story? vLLM is still the reference point for open, practical serving. Its current docs already show support for new model families like Gemma 4, including the 26B-A4B MoE model, through an OpenAI-compatible API across NVIDIA, AMD, and TPU setups. That matters because TokenSpeed is not replacing vLLM’s role in the ecosystem so much as attacking the performance ceiling that often pushes teams toward more specialized runtimes. (docs.vllm.ai) ### So what changed for developers? The menu got better. Before, the implicit choice was often “easy and open” versus “fast and specialized.” TokenSpeed is an attempt to make “open and fast” a real option — especially for teams serving agent loops, coding copilots, and long-context workflows on modern accelerators. If LightSeek can turn this preview into a stable runtime, it will matter less which single vendor stack you buy into first. (github.com) ### Bottom line? TokenSpeed is not important because it exists. Open-source inference repos appear every week. It is important because it is aiming at the hardest part of the stack — the performance layer — while keeping developer ergonomics in view. If that holds up, the center of gravity in LLM serving shifts a bit away from closed or hardware-tied runtimes and back toward open infrastructure people can actually modify. (github.com)