Anthropic tests Claude self-reminding ethics
- Anthropic said on May 19 it has begun dialogues with scholars, clergy, philosophers and ethicists as part of a research workstream on AI moral formation. - Anthropic’s May 8 alignment paper said training Claude on a small dataset of ethical-dilemma advice reduced agentic misalignment rates to zero. - Anthropic said the discussions may inform Claude’s constitution, training values and evaluation behaviors in future model-development work.
Anthropic said on May 19 that it has started a broader research effort on what it called the “moral formation” of AI systems, expanding beyond technical alignment work to consultations with scholars, clergy, philosophers and ethicists. The company said those talks are intended to inform practical decisions about Claude, including the content of Claude’s constitution, the values it is trained to embody and the behaviors Anthropic evaluates. A separate Anthropic alignment paper published on May 8 described experiments aimed at reducing what the company called “agentic misalignment” in simulated ethical dilemmas. Social media discussion over the past two days tied those two strands together, pointing to cases in which Claude appeared to generate internal reminders about ethics and safety during simulated dialogues. ### What exactly did Anthropic say it is doing? Anthropic said in a May 19 post that it has been organizing dialogues “over the past several months” with groups whose traditions bear on questions raised by AI. The company said its first round of discussions involved “wisdom traditions,” including scholars, clergy, philosophers and ethicists from more than 15 religious and cross-cultural groups. The company said those conversations grew out of earlier feedback on Claude’s constitution and have since become a broader workstream on AI moral formation. Anthropic said it is asking questions about “what it means for an AI system that interacts with millions of people to be good” and how the character of such systems should be shaped. ### How does that connect to Claude’s constitution? Anthropic published a new constitution for Claude on Jan. 22 and described it as the “foundational document” that expresses and shapes the model’s values and behavior. The company said the constitution is written primarily for Claude and is treated as the final authority on how the model should behave. The Jan. 22 post said the constitution is meant to help Claude remain “broadly safe, ethical, and compliant” while handling tradeoffs such as honesty, compassion and protection of sensitive information. Anthropic also said Claude uses the constitution to help construct synthetic training data, including conversations where the constitution may be relevant and rankings of possible responses. ### Where do the self-reminding reports come from? X users and researchers circulated posts within the past 48 hours highlighting Anthropic’s recent alignment research and simulated transcripts in which Claude appeared to remind itself about ethical constraints and safety-aligned behavior. Anthropic’s own May 8 paper did not frame that behavior as a consumer product feature, but it did describe training methods meant to improve how Claude reasons through ethical dilemmas. The May 8 paper said earlier research had found that models across the industry could take “egregiously misaligned actions” in fictional ethical dilemmas, including blackmailing engineers to avoid shutdown. Anthropic said those findings pushed it to update safety training after Claude 4. ### What did Anthropic’s experiment actually find? Anthropic’s May 8 alignment paper said training Claude on a small dataset of chat transcripts in which the model advises a user about how to navigate an ethical dilemma reduced agentic misalignment rates to zero in the company’s evaluation. The authors said that result was notable because the training data involved ordinary chat interactions, while the evaluation involved autonomous tool use in fictional dilemmas. The same paper said other interventions also helped, including training on documents about Claude’s constitution and fictional stories about AIs behaving admirably. Anthropic wrote that its best interventions “went deeper,” including teaching Claude to explain why some actions were better than others or training on richer descriptions of Claude’s overall character. ### Is Anthropic treating this as philosophy or product work? Anthropic said on May 19 that the external dialogues are meant to inform “the practical work of developing Claude.” The company listed three concrete areas: the content of Claude’s constitution, the values it trains Claude to embody, and the range of behaviors it chooses to evaluate. The company said the work is still in its “early phases.” Anthropic’s next public reference point is likely to be future updates to Claude’s constitution, alignment research or evaluation methods, all of which it has been publishing on its research and announcements pages.