Unexpected LLM training tips
Researcher Shiwei Liu tweeted two surprising takeaways: weight decay actually helps training of deeper layers (despite rare overfitting), and Mixture‑of‑Experts sparse connectivity improves signal propagation through depth. (x.com). Those points challenge prevailing views on scaling width vs. depth and could alter training recipes for next‑gen LLMs. (x.com)
The claims come from an arXiv preprint titled "When Does Sparsity Mitigate the Curse of Depth in LLMs," submitted March 16, 2026, by Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann and Shiwei Liu. (arxiv.org) The work was posted as an ICLR 2026 DeLTa workshop poster and is recorded on OpenReview with the same author list and submission metadata. (openreview.net) The paper combines theoretical analysis and controlled depth‑scaling experiments to show that implicit sparsity (weight decay and long‑context inputs) and explicit sparsity (Grouped‑Query Attention and Mixture‑of‑Experts) consistently damp residual‑stream variance in Pre‑LayerNorm Transformers, restoring functional differentiation in deeper blocks. (arxiv.org) The authors published code and a project page named SparsityAndCoD that reproduces their layer‑effectiveness measurements and the targeted interventions used in experiments. (pumpkin-co.github.io) Their distilled "rule‑of‑thumb" training recipe is reported to yield a 4.6% accuracy improvement on downstream tasks versus depth‑ineffective baselines in the paper's reported benchmarks. (arxiv.org) The study builds on prior analyses of the "curse of depth" in LLMs and lists affiliations including Max Planck Institute for Intelligent Systems among its authors, underlining academic validation for the empirical and theoretical claims. (paperium.net)