Anthropic tests alignment researcher
Anthropic published Fellows‑program research exploring an Automated Alignment Researcher that used Claude Opus 4.6 to oversee weaker models and accelerate alignment experiments. The work reports on whether a stronger model can supervise exploratory research tasks performed by smaller models. (x.com)
Anthropic said on April 14 that an internal Fellows-program project used Claude Opus 4.6 to run alignment experiments and found the system could speed up a narrow safety research task. (anthropic.com) The task was “weak-to-strong supervision,” a setup where a weaker model acts like a teacher for a stronger one, and researchers measure how much of the stronger model’s potential performance can still be recovered. Anthropic calls that score “performance gap recovered,” or PGR, on a 0-to-1 scale. (anthropic.com) Anthropic said two human researchers working for seven days closed 23% of that gap, while nine parallel “Automated Alignment Researchers” built on Claude Opus 4.6 closed 97%. The company said the model agents had extra tools and worked in separate sandboxes. (x.com, anthropic.com) The underlying problem is simple to state and hard to solve: if future systems become too capable for people to reliably judge every answer, researchers need other ways to check whether those systems are still following instructions. Anthropic frames weak-to-strong supervision as a practical stand-in for that larger “scalable oversight” problem. (anthropic.com) Anthropic said the best method found by the automated researchers also worked on two unseen datasets covering coding and math, while the second-best method generalized only to math. The company also said the system is not a general-purpose alignment scientist and would struggle more on “fuzzier” research tasks. (x.com) The project came out of the Anthropic Fellows Program, which funds four-month research projects tied to the company’s safety agenda. Anthropic said more than 80% of fellows in its first cohort produced papers and more than 40% later joined the company full-time. (alignment.anthropic.com) Claude Opus 4.6 is Anthropic’s top model and was released on February 5, 2026. Anthropic markets it for coding, long task chains, and agent workflows, which are the kinds of capabilities the company is now testing on alignment work itself. (anthropic.com) The release also fits a broader Anthropic push toward automated safety tooling. In March, Fellows-program research introduced an “Automated Alignment Agent,” or A3, that Anthropic said could automatically generate data and fine-tune models to reduce failures such as sycophancy, political bias, and jailbreak-related behavior. (alignment.anthropic.com) Anthropic’s new result is narrower than a claim that models can align themselves end to end. The company said the experiment shows Claude can increase the rate of experimentation and exploration on a specific benchmark, not replace human judgment across alignment research. (anthropic.com, x.com)