vLLM Unlocks Multi-LoRA Serving for MoE Models

The vLLM project announced Multi-LoRA serving for Mixture-of-Experts (MoE) models, allowing a single base model to serve multiple LoRA adapters on one GPU. The new feature, developed with Amazon Science, reportedly delivers 454% higher throughput for models like Llama MoE.

The core challenge with serving Mixture-of-Experts (MoE) models fine-tuned with multiple LoRA adapters is managing the combined sparsity. MoE models route tokens to specialized "expert" networks, and multi-LoRA setups select different adapters for different requests, creating a compound sparsity problem that standard vLLM kernels couldn't handle. Amazon and vLLM developers created a new `fused_moe_lora` CUDA kernel that integrates LoRA's low-rank matrix operations directly into the MoE expert routing and computation process. This update addresses a significant bottleneck in multi-tenant scenarios where numerous custom models lead to underutilized, idle GPU capacity. By allowing a single base MoE model to serve many LoRA adapters, infrastructure complexity is reduced, and costs are lowered by consolidating workloads that would otherwise require separate model deployments. This is particularly beneficial for enterprise applications needing to support multiple customers, domains, or tasks from a single shared model backend. The performance gains are substantial, with tests on a GPT-OSS 20B model showing a 454% improvement in output tokens per second and an 87% reduction in time-to-first-token compared to previous vLLM versions. These optimizations, which include techniques like Split-K for load balancing and CTA swizzling for better cache reuse, also boost performance for dense (non-MoE) models. The improvements are available in vLLM version 0.15.0 and later. Under the hood, efficient multi-LoRA serving relies on systems that can dynamically load adapters from main memory to the GPU as needed for the current batch of requests. This avoids the need to store thousands of potentially large adapters in limited GPU VRAM. Techniques like Unified Paging, inspired by operating system memory management, create a single memory pool for both the KV cache and adapter weights to reduce fragmentation and I/O overhead.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.