Learning mechanics framework predicts scaling

- Jamie Simon, Daniel Kunin and 12 co-authors posted an April 23 paper arguing deep learning is approaching a predictive science they call “learning mechanics.” - The paper groups five research programs — toy models, tractable limits, scaling laws, hyperparameter theory and universality — as the basis for forecasting training outcomes. - The pitch lands as researchers also circulate FP4, context parallelism and reasoning-control papers as practical design signals. (arxiv.org)

Deep learning still works mostly by expensive trial and error, and a new April 23 paper argues that could change with a predictive framework called “learning mechanics.” (arxiv.org) The paper, “There Will Be a Scientific Theory of Deep Learning,” is by Jamie Simon, Daniel Kunin and 12 co-authors from UC Berkeley, Harvard, New York University, Stanford, the Flatiron Institute, the University of Pennsylvania and the Astera Institute. It says a scientific theory is emerging that can describe training dynamics, hidden representations, final weights and model performance. (arxiv.org) The basic claim is simple: treat training less like alchemy and more like mechanics. Instead of only measuring whether a model got better, the authors want equations that predict how learning changes when width, depth, data, learning rate or initialization change. (arxiv.org) (learningmechanics.pub) The authors organize that case around five lines of work. They point to solvable toy models, mathematically clean limits such as infinite width or depth, empirical scaling laws, theories that make hyperparameters transferable, and universal behaviors that recur across systems. (arxiv.org) One example is maximal update parameterization, or µP, which rescales networks so hyperparameters found on a small model can transfer to a larger one. The paper presents that as evidence that at least some training choices obey stable rules instead of one-off recipes. (arxiv.org) (imbue.com) Another example is scaling laws, the power-law patterns that relate performance to model size, data and compute. The authors argue those regularities look less like isolated benchmarks and more like the kind of coarse laws that usually appear before a field gets a fuller theory. (arxiv.org) (openreview.net) That framing arrived alongside a burst of papers practitioners read as immediate engineering guidance. One April paper on 4-bit attention said stable floating-point 4-bit training needs low-precision recomputation in the backward pass and fixes for precision assumptions inside FlashAttention-style gradients. (arxiv.org) A separate context parallelism paper tackled long prompts by splitting attention work across many graphics processors. The authors reported 1 million-token prefill for Llama 3 405B in 77 seconds on 128 Nvidia H100 graphics processors, with 93% parallelization efficiency. (arxiv.org) Another April paper asked whether models carry a shared internal “logical subspace” that aligns natural-language reasoning with symbolic reasoning. Its authors said steering activations in that shared space improved multi-step logic performance without attaching an external solver. (arxiv.org) The new learning-mechanics paper does not claim those results are already unified under one finished theory. It argues the opposite: the field now has enough recurring laws, limits and transferable rules to start building one. (arxiv.org) If that program works, the promise is narrower than “understand all intelligence” and more concrete than that slogan sounds. It would mean predicting, before a full training run, which architectures, hyperparameters and system tricks are likely to scale. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.