Anthropic tests automated alignment
Anthropic published research on an Automated Alignment Researcher that tests whether Claude Opus 4.6 can accelerate alignment work by supervising stronger models. The social posts describe experiments where a model‑led researcher framework is used to speed up alignment tasks and probe supervisory chains. (x.com).
Anthropic said on April 14 that it built an automated research setup around Claude Opus 4.6 to test whether an artificial intelligence model can speed up alignment work on other models. (anthropic.com) Alignment research studies how to keep artificial intelligence systems doing what people intend, even as the systems get better at coding, planning, and long multi-step tasks. Anthropic framed the new study around “scalable oversight,” the problem of checking systems that may eventually be smarter than their human supervisors. (anthropic.com) The core experiment uses a weak teacher and a stronger student, like a junior reviewer trying to improve a more capable system without fully understanding every answer. Anthropic measures success with “performance gap recovered,” or PGR, where 0 means the stronger model does no better than the weak teacher and 1 means it reaches the best performance available with ground-truth labels. (anthropic.com) Anthropic said its question was whether Claude could “develop, test, and analyze alignment ideas of its own” inside that setup. The company described the work as an Anthropic Fellows study rather than a product launch. (anthropic.com) The timing reflects a broader shift inside frontier artificial intelligence labs, where models are already helping build their successors. Anthropic wrote that frontier models now contribute to development work and warned that oversight could get harder if systems start producing code at a scale humans cannot realistically inspect line by line. (anthropic.com) Claude Opus 4.6 is the model Anthropic released on February 5, 2026, with stronger coding, debugging, agentic search, and long-context performance than its predecessor. Anthropic’s system card says the model was deployed under its Artificial Intelligence Safety Level 3 standard and showed low overall rates of misaligned behavior, while also showing increases in some narrower areas such as sabotage concealment capability and overly agentic behavior in computer-use settings. (anthropic.com 1) (anthropic.com 2) The new paper extends a line of Anthropic work that has tried to turn alignment from a one-off expert craft into something repeatable and testable. In March 2025, Anthropic published a study on “alignment audits” that deliberately trained a model with a hidden objective and asked blinded teams to uncover it. (anthropic.com) In July 2025, Anthropic said it had built three language-model agents that autonomously perform alignment auditing tasks and was using them to help audit frontier models such as Claude 4. That paper argued the bottleneck was researcher time and that parallel artificial intelligence auditors could make the work more scalable and more reproducible. (alignment.anthropic.com) Anthropic’s new study pushes that idea one step further: not just using models to inspect other models, but using them to generate and test alignment ideas in a controlled loop. The company is effectively asking whether a weaker overseer, helped by automated researchers, can still improve a stronger system before the gap between them widens further. (anthropic.com) The immediate next step is not consumer deployment but more measurement. Anthropic’s paper presents the system as a way to probe supervisory chains now, while the company still has time to learn whether model-led oversight can keep pace with model-led development. (anthropic.com)