‘Lost in Backpropagation’ flags LM head drain
A new paper argued the LM head creates a 'gradient bottleneck' that can lose up to 99% of training signal for large-vocab models — which explains slow convergence and points to architectural fixes needed for efficient large-scale training argued. That diagnosis is directly relevant to teams training very large tokenizers and could alter head design for faster convergence.
The preprint by Nathan Godey and Yoav Artzi, listed with Cornell University affiliation, was posted to arXiv on March 12, 2026 [submitted]. arxiv.org Their controlled experiments explicitly constrained the output linear layer’s rank to emulate reduced head capacity and showed that doing so slows convergence for transformer backbones in 2‑billion‑parameter training runs. arxiv.org A targeted “SpamLang” probe in the paper’s supplemental tests found models with vocabulary sizes above 100,000 tokens could not learn a single‑symbol repetition rule in pretraining, and the authors report the effect persisted whether embeddings were tied or untied. emergentmind.com The manuscript argues for redesigned LM heads and accompanying heuristics, with reviewers’ summaries highlighting candidate remedies such as explicit higher‑rank parameterizations, logits‑preserving modules, and head‑specific scaling rules to preserve backward signal without increasing dataset size. paperium.net