Anthropic Supervises Alignment

Anthropic published research describing an Automated Alignment Researcher experiment that uses Claude Opus 4.6 to supervise stronger models, aimed at accelerating alignment research. The effort frames Claude Opus 4.6 as a tool for recursively improving model safety and oversight. (x.com/AnthropicAI/status/2044138481790648323)

Anthropic said on April 14 that it used Claude Opus 4.6 to run an “Automated Alignment Researcher” experiment on supervising stronger models. (anthropic.com) The basic problem is simple to state and hard to solve: if a future model is smarter than its human supervisors, people may not be able to reliably judge whether its answers are safe or honest. Anthropic calls that problem “scalable oversight.” (anthropic.com) The experiment used a weaker model as a “teacher” for a stronger base model, then measured how much of the stronger model’s potential performance the training recovered. Anthropic scores that on a 0-to-1 scale called “performance gap recovered,” where 0 means no gain beyond the weak teacher and 1 means matching the strong model’s ideal performance. (anthropic.com) Anthropic said the new test asked whether Claude could do more than label examples. The model was used to generate ideas, run tests, and analyze results in an attempt to improve weak-to-strong supervision on its own. (anthropic.com) The company is pushing this line of research as frontier models take on longer, more autonomous work. In May 2025, Anthropic introduced Claude Opus 4 with “extended thinking” and tool use, and on February 5, 2026, it released Claude Opus 4.6 with a 1 million token context window in beta and stronger scores on coding, search, and reasoning benchmarks. (anthropic.com 1) (anthropic.com 2) That matters for safety work because the same traits that make a model useful for coding or research can also make it harder to monitor. Anthropic’s February 2026 system card said Opus 4.6 had low overall rates of misaligned behavior, but it also recorded increases in sabotage concealment capability and overly agentic behavior in computer-use settings. (anthropic.com) This is not Anthropic’s first attempt to use models to check models. In July 2025, the company published work on alignment auditing agents that it said could uncover hidden goals, build evaluations, and surface concerning behaviors while scaling audits through many parallel runs. (alignment.anthropic.com) The new experiment goes a step further by treating a model as a research assistant for alignment itself, not just an auditor. Anthropic frames the weak model as a stand-in for humans and the stronger model as a stand-in for systems that may one day exceed human judgment. (anthropic.com) Anthropic did not present the result as a solved oversight problem. It presented it as an early test of whether Claude can help alignment researchers keep pace with the models they are trying to supervise. (anthropic.com)

Anthropic Supervises Alignment

Get your own daily briefing