Anthropic’s Auto‑Research update
Anthropic announced new research on an “Automated Alignment Researcher” that uses Claude Opus 4.6 to supervise stronger AI training, a development drawing attention on social platforms. (Anthropic posted details of the project on X and the announcement saw wide engagement.) (x.com)
Anthropic said on April 14 that it built an “Automated Alignment Researcher” using Claude Opus 4.6 to help study how weaker supervisors can train stronger artificial intelligence systems. (anthropic.com) The project targets a problem called weak-to-strong supervision: a weaker judge, standing in for humans, tries to guide a stronger model on tasks the judge cannot fully evaluate. Anthropic framed that as a core obstacle for aligning future systems that could exceed human abilities. (anthropic.com; arxiv.org) Anthropic said it tested nine parallel research agents and measured how much of the “performance gap” they could recover between a weak model’s labels and a stronger model trained on ground-truth labels. In the company’s reported experiment, the automated setup recovered 97% of that gap over cumulative research time. (anthropic.com) The system did not train models by itself from scratch. Anthropic described it as a research workflow that proposes experiments, runs code, analyzes results, and iterates on methods for supervising stronger models with weaker feedback. (anthropic.com) Alignment research is the field that asks how to keep an artificial intelligence system following human intent when the system is optimizing hard tasks on its own. Anthropic researcher Jan Leike says his team is working on “how to align an automated alignment researcher,” alongside scalable oversight and jailbreak robustness. (jan.leike.name) Anthropic has been building toward this line of work for more than a year. In 2024, OpenAI researchers including Leike published work showing that stronger models can outperform weaker supervisors but still fall short of their full potential under naive training. (arxiv.org) Anthropic has also been expanding automated safety work around its models. In July 2025, the company published research on “auditing agents” that investigate hidden goals and flaws in models during alignment assessments. (venturebeat.com; anthropic.com) The model behind the new project, Claude Opus 4.6, was released in February 2026 and is Anthropic’s top-tier system for complex, long-running work. Anthropic’s system card said Opus 4.6 was deployed with Artificial Safety Level 3 safeguards. (anthropic.com; www-cdn.anthropic.com) Other labs are also pushing on external and cross-lab safety checks. OpenAI and Anthropic published a joint evaluation exercise in August 2025 that tested each other’s publicly released models for misalignment, hallucinations, and jailbreak resistance. (openai.com) Anthropic’s update points to a future where models do more of the safety research on other models. The open question, which Anthropic and other labs keep returning to, is whether those automated researchers can be trusted as the systems they study grow more capable. (anthropic.com; jan.leike.name)