Frozen LM explains itself
A reported experiment found that a frozen language model with a tiny adapter described its own internal features more accurately than human labels—achieving 71% accuracy versus 63%—and the result generalized across layers and multi‑hop reasoning tasks (x.com). The post frames the adapter as a lightweight interpretability tool that transfers across tasks without retraining the whole model (x.com).
Language models turn text into long lists of numbers inside each layer, and researchers have been trying to translate those hidden patterns back into plain English. A February 12, 2026 preprint reports that a frozen model plus a tiny trained adapter did that more accurately than the human-written labels it learned from. (arxiv.org) The paper, by Keenan Pepper and colleagues at AE Studio and Princeton, says a scalar affine adapter with just *d*model + 1 parameters beat the original training labels on a generation-scoring test at 70 billion parameters: 71% accuracy versus 63%. The authors kept the language model’s weights fixed and trained only the small mapping layer. (arxiv.org) The same abstract says the trained adapters identified topics with 94% recall at 1, compared with 1% for untrained baselines, and decoded “bridge entities” in multi-hop reasoning prompts even when those entities appeared in neither the prompt nor the response. The authors present that as evidence the method can surface intermediate reasoning without asking the model to print chain-of-thought. (arxiv.org) This work sits inside mechanistic interpretability, a field that tries to map model internals the way a circuit diagram maps a radio. Recent systems such as SelfIE and Patchscopes also ask models to describe their own hidden states in natural language, but this new paper argues those prompt-only methods are sensitive to setup choices such as scaling. (arxiv.org 1) (arxiv.org 2) (arxiv.org 3) The technical move is small but specific: instead of retraining the whole model, the adapter maps an internal activation vector into the model’s token-embedding space, where the frozen model can continue the description in words. The paper says the adapter was trained on existing interpretability artifacts, including sparse autoencoder feature labels and contrastive activation vectors. (arxiv.org) Sparse autoencoders are another interpretability tool: they break dense internal activations into a larger set of more separable features, each paired with example texts and a human label. The new paper’s claim is that a lightweight adapter can learn from those vector-label pairs and then produce labels that score better than the original human annotations. (arxiv.org) The authors also report that the learned bias vector alone explained 85% of the improvement, and that simpler adapters generalized better than more expressive ones. They say self-interpretation gains increased from 7 billion to 72 billion parameters even after controlling for the model’s general knowledge through prompted descriptions. (arxiv.org) The paper is still a preprint, and the public discussion around it so far appears to be driven largely by social posts rather than peer-reviewed follow-up. For now, the reported result is narrower than “the model understands itself”: a frozen model, with a very small add-on, generated labels for internal features that matched the paper’s scoring setup better than the human labels it started from. (arxiv.org)