Developers Run vLLM on AMD Hardware
A community of developers is exploring alternatives to Nvidia's hardware by running the vLLM inference and serving library on AMD devices. A detailed walkthrough shows how to build and run vLLM from source on an AMD Strix Halo device. Others are actively testing optimizations like Prefix Cache and FP8 quantization to improve performance on the platform.
- The effort to run vLLM on AMD hardware is part of a broader industry push to create a viable alternative to Nvidia's dominance in the AI chip market, where Nvidia holds an estimated 80-95% market share. - AMD's software platform, ROCm (ROCmâ„¢ Open Software Platform), is the key to enabling GPU-accelerated machine learning on their hardware, analogous to Nvidia's CUDA ecosystem. Recent versions have expanded support for consumer-grade Radeon GPUs, making AI development more accessible. - vLLM is an open-source library designed for high-throughput and memory-efficient LLM inference and serving. Its support for AMD GPUs via ROCm allows developers to leverage features like PagedAttention and continuous batching on non-Nvidia hardware. - FP8 quantization, one of the optimizations being tested, reduces the memory footprint of the model's key-value (KV) cache by using 8-bit floating-point numbers. This allows more data to be stored in the cache, improving throughput with minimal impact on accuracy on supported hardware like the AMD MI300 series. - Prefix caching is a technique that stores the initial part of a prompt's computation (the "prefix") to reuse it for subsequent requests that share the same beginning, which can significantly speed up inference for common workloads. - The "Strix Halo" device is a high-performance Accelerated Processing Unit (APU) from AMD, officially named Ryzen AI Max. It combines a powerful Zen 5 CPU with a large RDNA 3.5 integrated GPU, providing substantial memory bandwidth and compute power for AI tasks in a mobile form factor. - Benchmarks comparing AMD and Nvidia for LLM inference show a competitive landscape. While Nvidia's top-tier GPUs often lead, AMD's Instinct MI300X has shown superior performance in specific scenarios, particularly those leveraging its larger memory capacity and bandwidth. However, some tests indicate that models run on ROCm can sometimes yield less accurate results compared to the same models on CUDA.