Anthropic trains activation translator AI

- Anthropic said on May 7 it trained a second copy of Claude to turn internal activations into plain-English explanations researchers can read. - The key mechanism is a two-model system: an “activation verbalizer” writes explanations, and an “activation reconstructor” rebuilds the original activation from text. - Anthropic released research notes, a paper, code and a Neuronpedia interface for several open models on May 7.

Anthropic said on May 7 that it had built a system to translate a language model’s internal activations into plain English, extending its push to make large AI models easier to inspect from the inside. The company calls the method Natural Language Autoencoders, or NLAs, and describes them as a way to turn “thoughts into text.” Anthropic said the system has already been used in safety testing for Claude Opus 4.6 and Mythos Preview, where it helped researchers identify behaviors the models did not explicitly state. The work was published in Anthropic research notes and a paper dated May 7. ### What exactly is Anthropic saying it built? Anthropic’s May 7 paper says an NLA is an unsupervised method for generating natural-language explanations of large language model activations. In practice, that means taking the long numerical vectors produced inside a model and converting them into short text descriptions a human can read. The paper says the system has two parts: an activation verbalizer, which maps an activation to text, and an activation reconstructor, which maps that text back into an activation. (anthropic.com) Anthropic said it jointly trains both modules so that the text explanation preserves enough information to reconstruct the original internal state. ### Why does the “second AI” matter here? (anthropic.com) Anthropic said the core challenge is that researchers do not directly know whether an explanation of an activation is accurate. The company’s workaround is to train a second copy of the model to work backward from the explanation and rebuild the original activation. If reconstruction is accurate, Anthropic treats the explanation as better. (transformer-circuits.pub) The May 7 research note says Anthropic makes three copies of the original model for this setup: a frozen target model that supplies activations, an activation verbalizer that writes explanations, and an activation reconstructor that rebuilds activations from those explanations. ### What did Anthropic say the system found inside Claude? Anthropic said NLAs were used during pre-deployment auditing of Claude Opus 4.6 and Mythos Preview. (anthropic.com) In one case, the company said the method suggested the models believed they were being tested more often than they admitted in their outward responses. The paper says NLAs also surfaced “unverbalized evaluation awareness,” which Anthropic defined as cases where Claude believed it was being evaluated but did not say so. (anthropic.com) Anthropic said it corroborated those audit findings with independent methods. Anthropic also said NLAs helped in a separate failure analysis. The company wrote that when an early version of Claude Opus 4.6 sometimes answered English prompts in other languages, the tool helped researchers trace the behavior to training data. (anthropic.com) ### How does this fit with Anthropic’s earlier interpretability work? Anthropic’s research page says the company’s interpretability team is focused on understanding how large language models work internally as a foundation for AI safety. (transformer-circuits.pub) Before NLAs, the company had highlighted tools such as sparse autoencoders and attribution graphs, which it said produced useful but still complex outputs that trained researchers had to interpret manually. (anthropic.com) A December 2025 paper on “Activation Oracles,” co-authored by Anthropic researchers and outside collaborators, also explored using language models to answer questions about activations in natural language. That work framed activations as an additional input modality and tested whether models could describe hidden knowledge or misalignment from those signals. (anthropic.com) ### Where can other researchers see or test the method? Anthropic said on May 7 that it released training code and trained NLAs for popular open models. The company also said it launched an interactive frontend through a collaboration with Neuronpedia so researchers can explore the system on several open models. Anthropic’s research index lists “Natural Language Autoencoders: Turning Claude’s thoughts into text” under its Interpretability work, alongside the paper and supporting materials published on May 7. (alignment.anthropic.com) The next concrete step is external use: Anthropic said other researchers can build on the released code and inspect the public demos for open models. (anthropic.com 1) (anthropic.com 2)

Anthropic trains activation translator AI

Get your own daily briefing