Expert questions utility of LLM scaling law plots
AI researcher Stella Rose Biderman has cautioned that widely-cited "scaling law" plots from OpenAI have little bearing on the actual dynamics of large-scale model training. Biderman argues that modern training loss curves exhibit more complex and less predictable behavior. This suggests that simplistic extrapolations from these plots are unreliable for capacity planning or model evaluation.
- The original 2020 OpenAI paper by Jared Kaplan and colleagues introduced the concept of power-law relationships between training loss and three key factors: model size (parameters), dataset size, and compute. This research suggested that performance improvements were predictable and that larger models were significantly more sample-efficient. - Stella Rose Biderman is affiliated with EleutherAI and has co-authored papers on the Pythia suite of models and The Pile, an 800GB dataset for language modeling. Her work often focuses on the practical aspects and limitations of training large-scale models. - Research from DeepMind in 2022, often called the "Chinchilla paper," challenged OpenAI's initial findings. It argued that for a given compute budget, many existing models like GPT-3 were too large and trained on insufficient data, suggesting that smaller models trained on more data could achieve better performance. - Modern training loss curves often don't follow a smooth, predictable decrease. Instead, they can exhibit instability, with sudden jumps or dips, which can be caused by factors like a high learning rate or inconsistencies in the data pipeline. - While scaling laws may predict pre-training loss in stable conditions, they are often unreliable for predicting performance on specific downstream tasks. A model's real-world capabilities can emerge in sudden jumps rather than smooth, predictable gains. - The log-scale axes used in many scaling law plots can be misleading. They obscure the exponential growth in compute, data, and cost required for each linear improvement in model performance, making progress appear steadier than it is. - There is a growing research interest in "downscaling" and finding more efficient training methods as an alternative to brute-force scaling. This includes strategies like better data pruning, architectural innovations, and developing smaller, more specialized models. - Factors beyond just scale, such as data composition (e.g., the percentage of code in the training set) and architectural choices like using rotary versus learned embeddings, have a significant impact on a model's downstream performance.