AdaLLM project enables NVFP4 inference on GPUs

A developer has released AdaLLM, an open-source project designed to make NVFP4 weights usable on Ada Lovelace GPUs like the RTX 4090. NVFP4 is a 4-bit numerical format that reduces model size and can speed up inference. The project utilizes techniques such as an FP8 KV cache and custom FP8 decode kernels to optimize performance, building on the vLLM library for LLM inference.

- NVIDIA's NVFP4 is a 4-bit floating-point format officially introduced with the Blackwell GPU architecture; enabling it on the older Ada Lovelace architecture is a software achievement as these GPUs lack native hardware support. - The NVFP4 format uses a technique called block microscaling, where groups of 4-bit values share a higher-precision scaling factor (an 8-bit float), which helps maintain model accuracy despite the aggressive quantization. - The use of an FP8 KV cache can reduce the memory footprint for the key-value cache by 50% compared to FP16, allowing for larger batch sizes or longer context windows during inference. - This project builds upon vLLM, a high-throughput inference library known for its efficient memory management through a feature called PagedAttention and continuous batching of requests. - Ada Lovelace GPUs feature fourth-generation Tensor Cores with support for FP8, while the newer Blackwell architecture includes fifth-generation Tensor Cores with dedicated hardware acceleration for FP4 operations. - Quantizing the Key-Value cache from a standard 16-bit format to 4-bit can double the amount of context that can be stored in the same amount of memory. - The vLLM library, which AdaLLM is based on, integrates with high-performance kernels from libraries like FlashInfer to accelerate various parts of the inference process, including attention mechanisms.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.