Why big LLMs still 'hear' rare logic

A new thread proposes a theory for how very large language models — those with over 10¹³ parameters — can preserve rare logical signals instead of washing them out during training. The post ties the effect to mechanisms like Fisher information matrix (FIM) decoupling, stochastic gradient descent preferring flat minima, and predictive utility driving subspace convergence and compositional generalization (1e13+ parameter claim). (x.com) (x.com)

Language models learn by nudging billions of numbers so the next word gets easier to predict, and a new April 2026 thread argues very large models can keep rare logical patterns instead of averaging them away. (x.com) The post says the effect shows up in models with more than 10¹³ parameters, then links that scale claim to three ideas from machine-learning theory: parameter directions that decouple, stochastic gradient descent favoring flatter solutions, and convergence in useful subspaces. (x.com) One of those ideas starts with the Fisher information matrix, a tool researchers use to ask which parameter changes would most alter a model’s predicted probabilities; in plainer terms, it maps which knobs matter together and which mostly move on their own. NeurIPS papers and tutorials describe Fisher-based methods as a way to capture local geometry, while also warning that the common empirical approximation can misstate true second-order structure. (proceedings.neurips.cc) (arxiv.org) Another idea is “flat minima,” meaning solutions where many small parameter changes do not sharply worsen loss; several papers tie stochastic gradient descent, the noisy optimizer used in deep learning, to a tendency to escape sharper basins and settle in flatter ones. OpenReview and Proceedings of the National Academy of Sciences papers both describe that link between gradient noise and flatter, better-generalizing solutions. (openreview.net) (pnas.org) The thread’s claim is that, at very large scale, rare logical features may occupy their own partly separated directions in parameter space, so training on common text patterns does not fully overwrite them. That is a hypothesis, not an established result in the post itself, and the cited social-media thread does not present a peer-reviewed experiment proving the 10¹³ threshold. (x.com) That question sits inside a broader argument over what large language models actually learn when they appear to reason. Researchers have reported compositional generalization in some settings, where a model handles new combinations of familiar instructions, but they also find uneven performance depending on task design and instruction-following setup. (aclanthology.org 1) (aclanthology.org 2) Scale is part of that backdrop. Survey papers describe modern large language models as systems with billions of parameters trained on massive text corpora, and classic optimization work shows wider networks often have better-behaved loss surfaces than smaller ones. (arxiv.org) (proceedings.mlr.press) The caution is that each piece of the argument comes from a different literature. Fisher information matrix structure, flat-minima dynamics, and compositional generalization are all active research areas, and one of the best-known Fisher papers explicitly argues that some popular approximations should not be treated as faithful curvature estimates. (proceedings.neurips.cc) So the thread is best read as a synthesis: a proposal for why giant models might still “hear” infrequent logical signals after long training runs, built from existing ideas rather than a new benchmark paper. The next step is the part the post leaves open—showing, with direct measurements on frontier-scale systems, whether those rare signals really stay isolated enough to survive. (x.com 1) (x.com 2)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.