TensorRT‑LLM shatters inference numbers

TensorRT‑LLM is reporting massive gains — over 10,000 output tokens/sec on NVIDIA H100 using FP8 and time‑to‑first‑token under 100ms, with end-to-end throughput roughly 4x native PyTorch in production cases. The improvements come from aggressive kernel fusion and memory-layout optimizations that materially lower latency and cost for high‑volume serving. (introl.com)

NVIDIA published a v1.3.0 release candidate for TensorRT‑LLM with detailed feature and API changes on the project’s GitHub releases page three days ago. (github.com) The official GitHub repository shows roughly 13.2k stars and more than 5,600 commits, with active commits recorded as recently as yesterday. (github.com) NVIDIA’s H100 vs A100 benchmarks for TensorRT‑LLM report up to 4.6× max throughput on H100 FP8 versus A100 and note the H200 reaches nearly 12,000 output tokens/sec on Llama‑2‑13B in their measurements. (nvidia.github.io) The v1.3.0rc release notes add explicit model/back-end support items including Nemotron 3 Super, GLM‑4 tooling, VisualGen attention backends, and new precision formats such as 2FP4/Arcquant. (github.com) Operational and serving improvements called out in the release include gRPC keepalive ping tolerance, context.abort support for the server, deprecation of certain trtllm‑serve CLI options, and a FlashInfer API for TRTLLM‑Gen fused MoE workloads. (github.com) NVIDIA’s documentation and examples describe runtime optimizations beyond single‑kernel improvements — notably paged attention, chunked context (chunked prefill), multiple request schedulers, in‑flight batching, and a KVCacheManagerV2 for cache-level memory management. (nvidia.github.io) TensorRT‑LLM is available on PyPI as a packaged distribution and recent releases add deployment targets such as Triton backend support and a beta DGX Spark integration for larger cluster workflows. (pypi.org)

TensorRT‑LLM shatters inference numbers

Get your own daily briefing