vLLM Update Delivers Cross-Platform GPU Performance

The vLLM project's new Triton attention backend now delivers cross-platform performance for NVIDIA, AMD, and Intel GPUs. It achieves state-of-the-art speeds on the H100 and a 5.8x speedup on AMD's MI300, a key development for building scalable inference engines that aren't locked into a single hardware vendor.

The vLLM project, originating from UC Berkeley's Sky Computing Lab, is an open-source library designed to make Large Language Model (LLM) inference and serving faster and more memory-efficient. It addresses key challenges in LLM serving, where traditional methods can waste 60-80% of GPU memory, by using a novel attention algorithm called PagedAttention. This approach has demonstrated up to a 24x higher throughput compared to standard HuggingFace Transformers. PagedAttention is the core innovation within vLLM, inspired by virtual memory and paging concepts from operating systems. It partitions the key-value (KV) cache into blocks, allowing for non-contiguous storage in memory. This method drastically reduces memory waste to under 4%, enabling more efficient batching of requests and significantly lowering memory overhead for complex sampling algorithms. The new Triton attention backend is a key development for cross-platform performance, running the same source code on NVIDIA, AMD, and Intel GPUs. This backend, initially developed by IBM Research and Red Hat AI, is now community-maintained and serves as the default on AMD GPUs. This move was driven by the increasing diversity of AI hardware and the high cost of maintaining specialized kernels for each platform. This hardware-agnostic approach directly challenges vendor-specific solutions like NVIDIA's TensorRT-LLM. The vLLM project has evolved into a community-driven effort with significant contributions from industry players like Red Hat, IBM, Google, Meta, and others, and is now governed by the Linux Foundation. This broad support underscores the industry's move towards more open and flexible AI infrastructure. On AMD's MI300X, the architectural advantages of larger memory (192GB vs. 80GB on H100) and higher memory bandwidth (5.3 TB/s vs. 3.3 TB/s) are significant. Benchmarks have shown the MI300X with vLLM can outperform the H100, in some cases nearly doubling the request throughput at a lower latency. For certain workloads, this has translated to a performance uplift of over 2x compared to an H100 running a standard vLLM suite. The vLLM V1, a major redesign of the internal architecture, was released in January 2025 to simplify the codebase and enable all performance optimizations by default. Initially, V1 only supported NVIDIA GPUs due to its reliance on the CUDA version of FlashAttention. The development of the Triton-based attention backend by teams from AMD, IBM Research, and Red Hat was crucial to enabling support for AMD GPUs and achieving a 10% higher throughput on the MI300X compared to the previous version.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.