Databricks Paper Details LLM Memory Breakthrough

A new research paper from Databricks outlines a method to cut the memory required for LLM training by 50%. The technique works by optimizing how parameters, optimizers, and gradients are handled, potentially making it much cheaper and more accessible to train powerful foundation models.

The Databricks research introduces "FlashOptim," a suite of techniques that directly compresses the memory footprint associated with deep learning optimizers. It targets the significant memory overhead required to store parameters, their gradients, and optimizer states like Adam, which can demand roughly 16 bytes for every single model parameter. FlashOptim's efficiency comes from using improved float splitting and companded quantization for the optimizer state, then executing the entire optimizer step within a single "fused" kernel on the GPU. This approach avoids the common bottleneck of moving large amounts of data between different levels of memory, which often slows down alternative methods like CPU offloading. In a supervised fine-tuning test on Llama-3.1-8B, the technique reduced peak GPU memory from 175GB to 113GB. This was achieved by shrinking optimizer state memory by 61% and parameter memory by 50%, all while slightly reducing the optimizer step time from 12.5 to 11.5 milliseconds. This method differs from other popular memory-saving strategies. Unlike parameter-efficient fine-tuning (PEFT) methods such as LoRA, FlashOptim updates all model weights, avoiding approximations that can limit performance on complex tasks. It also provides an alternative to sharding frameworks like FSDP, which require a large cluster of GPUs and are often out of reach for smaller teams. A 50% memory reduction fundamentally alters the hardware calculus for training. A 7-billion parameter model, which would typically require at least 112GB of accelerator memory and thus multiple GPUs, could be brought within the range of a single high-end GPU like NVIDIA's 80GB H100. For developers, FlashOptim is designed to be a drop-in replacement for standard PyTorch optimizers. This ease of implementation, requiring no changes to existing training loops or tuning strategies, is critical for rapid adoption by research labs and AI-focused startups. Lowering the hardware barrier to entry democratizes the training of powerful foundation models. This could accelerate the development of specialized, vertical-specific AI by startups in sectors like healthtech and fintech, which can outmaneuver larger players by focusing on proprietary data and niche applications.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.