Anthropic tests alignment acceleration

Anthropic released research on so-called Automated Alignment Researchers and said it tested Claude Opus 4.6 to speed work on alignment tasks like weak‑to‑strong supervision during AI training. The company also appointed Novartis CEO Vas Narasimhan to its board via a Long-Term Benefit Trust, bringing medical and global-health expertise to its governance. (x.com) (x.com)

Anthropic said on April 14 that it used Claude Opus 4.6 to help automate parts of AI alignment research, a field focused on keeping advanced models behaving as intended. (anthropic.com) Alignment research asks how to supervise systems that may eventually reason better than their human overseers. Anthropic’s new paper studies “weak-to-strong supervision,” where a weaker model acts like a stand-in for a human teacher guiding a stronger one during training. (anthropic.com) In Anthropic’s setup, researchers start with a stronger base model, fine-tune it using examples from a weaker “teacher” model, and then measure how much of the stronger model’s potential performance is recovered. The company calls that score “performance gap recovered,” where 0 means the strong model stayed at the weak teacher’s level and 1 means it reached its own best possible outcome. (anthropic.com) The new experiment tested whether Claude could do more than follow instructions and instead propose, run, and analyze alignment experiments on its own. Anthropic said the system used Claude Opus 4.6 as an “Automated Alignment Researcher” to search for methods that improve weak-to-strong supervision. (anthropic.com) Anthropic tied that work to a practical problem: frontier models are already helping build their successors, and companies need safety research to speed up as model capabilities rise. The paper frames scalable oversight as the problem of checking systems that may produce outputs too complex for humans to fully inspect, such as millions of lines of code. (anthropic.com) Claude Opus 4.6 is the model Anthropic introduced on February 5, with a 1 million-token context window in beta and new features for long-running agentic tasks. Anthropic said the model is available on Claude, its application programming interface, and major cloud platforms at unchanged pricing of $5 per million input tokens and $25 per million output tokens. (anthropic.com) The company has been pushing related automation work for months. In March, an Anthropic Fellows Program project introduced an “Automated Alignment Agent,” or A3, for safety fine-tuning, saying it could generate data, run fine-tuning loops, and reduce safety failures such as sycophancy and political bias with less human intervention. (alignment.anthropic.com) Anthropic also changed its governance on April 14, when its Long-Term Benefit Trust appointed Novartis Chief Executive Officer Vas Narasimhan to the board. Anthropic said Narasimhan is a physician-scientist, and that trust-appointed directors now make up a majority of the board. (anthropic.com) The Long-Term Benefit Trust was unveiled in September 2023 as an independent body of five financially disinterested members with power to select and remove a growing portion of Anthropic’s board, ultimately a majority. Anthropic said the structure is meant to pair with its public benefit corporation status and keep governance tied to its long-term public mission rather than only shareholder interests. (anthropic.com) With the new research and the board change landing the same day, Anthropic is pairing faster safety work inside the lab with tighter mission-focused control at the board level. Both moves center on the same question in Anthropic’s April 14 paper: how to keep more capable AI systems aligned as they take on more of the work themselves. (anthropic.com)

Anthropic tests alignment acceleration

Get your own daily briefing