Shorya Mishra flags deep learning paper
- ArXiv got a new manifesto on April 23: Jamie Simon, Daniel Kunin, and 12 coauthors argued deep learning is becoming a predictive science. - The paper names that emerging program “learning mechanics” and says five research strands already explain training dynamics, scaling behavior, and hyperparameter transfer. - If that holds up, model building gets less like alchemy and more like engineering.
Deep learning is still weirdly primitive for something so economically important. We can train giant models that write code and reason through problems, but a lot of the process still runs on folklore, scaling heuristics, and expensive trial and error. That is the gap this new paper is trying to close. On April 23, Jamie Simon, Daniel Kunin, and 12 coauthors posted an arXiv paper arguing that a real scientific theory of deep learning is starting to emerge — and they want to give that program a name: “learning mechanics.” (arxiv.org) ### What is the actual claim? The paper is not claiming the theory already exists in finished form. It is making a stronger-than-usual coordination move: the authors say several separate lines of theory are converging into one discipline that should explain training dynamics, hidden representations, final weights, and performance with quantitative, falsifiable predic(arxiv.org) should become something engineers can forecast, not just tune by feel. (arxiv.org) ### Why does that matter? Because today’s workflow is expensive. Teams often discover basic facts about a model — whether a learning rate is stable, whether a width change breaks transfer, whether extra compute will pay off — by burning compute and seeing what happens. The paper’s premise is that this is not a permanent condition. If theory can predict those macrosco(arxiv.org)training runs with far less waste. That is the practical stake hiding inside a very academic title. (imbue.com) ### What do they mean by “learning mechanics”? They mean a physics-like view of training. Not a theory of every neuron in full detail, but a theory of the coarse, repeatable statistics that matter at scale. Think less “explain every molecule in a hurricane” and more “derive the weather patterns that keep showing up(imbue.com)pness, representation change, and other training-level regularities. (arxiv.org) ### What evidence do they think already exists? They group the evidence into five buckets: solvable toy settings, tractable mathematical limits, simple laws for macroscopic observables, theories that isolate hyperparameters from the rest of training, and universal behaviors that recur across architectures and tasks. That list matters because it shifts the pitch from (arxiv.org)“pieces of the explanation are already on the table.” (arxiv.org) ### So is this about scaling laws? Partly, yes — but not only that. Scaling laws are one of the cleanest examples because they show stable power-law relationships between compute, data, model size, and loss. But the paper’s ambition is broader. It also points to things like edge-of-stability behavior and parameterization schemes such as µP, which aim to make hyperpar(arxiv.org)h larger ones. That is the difference between a catchy observation and an engineering framework. (imbue.com) ### Are they rejecting older theory? Not exactly. They do argue that some famous limits — especially infinite-width, near-linearized views like the classic NTK regime — miss too much of what makes deep learning interesting, especially feature learning and strongly nonconvex behavior. But the paper does not throw those tools away. It treats them as partial regimes that reveal structure, not as the final story. (arxiv.org) ### What’s the catch? The catch is that this is still a manifesto paper. It synthesizes, names, and organizes a field more than it proves one grand theorem. The open problems are still large — including where scaling exponents come from, how nonlinear training stays stable, and how far these regularities generalize across model classes. So the paper is best read as a(arxiv.org)earch program. (arxiv.org) ### Why are people paying attention now? Because the timing makes sense. Frontier-model training is now so costly that even modest improvements in predictability are valuable. And unlike a few years ago, there is now enough empirical regularity — especially around scaling and transfer — for theory to grab onto. That is why this paper landed as more than philosophy. I(arxiv.org) its “from craft to engineering” phase. (imbue.com) The bottom line is simple: this paper does not solve deep learning theory. But it does something important anyway. It says the field should stop treating predictability as a fantasy and start treating it as a buildable discipline. If that framing sticks, the long-term win is not elegance. It is fewer blind billion-parameter guesses.