Why huge LLMs still reason
Researchers and commenters have proposed a theory explaining how very large language models — models with 10^13+ parameters — can preserve rare logical signals even as they scale, using mathematical mechanisms rather than toy intuition. The explanation names three technical ingredients — Fisher Information Matrix (FIM) decoupling, SGD’s flatness bias, and subspace convergence for compositional generalization — as the drivers of that preserved signal (x.com).
Large language models can keep a weak “reasoning” signal alive even as they scale, according to a recent theory that ties the effect to training math rather than intuition. (openreview.net) A language model predicts the next token from patterns in huge text corpora, and researchers have long argued over whether larger models truly reason or just imitate reasoning-shaped text. Surveys and benchmark papers describe the debate, while scaling papers show some task abilities appear abruptly as models get bigger. (aclanthology.org) (openreview.net) The new explanation centers on three ingredients named in the discussion around the theory: Fisher Information Matrix decoupling, stochastic gradient descent’s flatness bias, and subspace convergence for compositional generalization. In plain terms, those ideas say some parameter directions carry important signal, training tends to keep broad stable solutions, and related reasoning skills can line up inside a smaller shared space. (proceedings.neurips.cc) (arxiv.org) (research.google) The Fisher Information Matrix is a sensitivity map: it measures which parameter changes alter a model’s predicted distribution and which changes barely matter. A NeurIPS 2021 paper describes it as information carried by observations along directions in parameter space, which is why commenters use it here to separate rare logical directions from the bulk of less relevant ones. (proceedings.neurips.cc) Stochastic gradient descent, the standard training method for deep networks, does not wander evenly through all solutions. Multiple papers report that it prefers flatter minima, meaning settings where small weight changes do not sharply increase loss, and one 2022 analysis gives a stability bound linking accessible minima to bounded sharpness. (arxiv.org) (link.aps.org) Compositional generalization is the ability to handle new combinations of familiar parts, like applying a learned rule to a novel sentence or equation. Google Research’s CFQ benchmark was built to test exactly that kind of recombination in semantic parsing, and it became a standard reference for whether models can extend beyond memorized templates. (research.google) (github.com) Put together, the theory says very large models do not need every parameter to encode logic directly. If the useful reasoning features sit in partly decoupled directions, and training keeps returning to wide stable basins, then those features can survive inside a lower-dimensional subspace even as the full model grows by orders of magnitude. (proceedings.neurips.cc) (arxiv.org) That framing also fits a broader shift in the field away from toy stories about “emergence” and toward distributional and optimization-based accounts. A 2025 paper on emergent capabilities argues that sudden jumps can come from continuous changes in the distribution of training outcomes, not a single hidden switch flipping on. (arxiv.org) The catch is that this is still an explanation, not a settled law of model behavior. Researchers still disagree on how much benchmark performance reflects genuine reasoning, how much comes from data contamination or prompting, and which mathematical objects best capture the signal inside trillion-parameter systems. (aclanthology.org) (openreview.net) What the theory offers is a cleaner answer to a persistent question: why making a model vastly bigger does not automatically wash out the rare structures that support logic-like behavior. It says scale can preserve those structures when training dynamics keep finding the same stable directions. (arxiv.org) (proceedings.neurips.cc)