New AI Safety Flaw Discovered
Researchers have developed a method called "Boundary Point Jailbreaking" (BPJ), the first automated attack to successfully bypass the safety classifiers of advanced AI models. The technique broke through Anthropic's constitutional classifiers and OpenAI's GPT-5 input classifier after thousands of hours of human red-teaming had failed. The discovery reveals significant vulnerabilities in current AI safety guardrails, posing risks for content moderation and curation systems.
- The attack operates in a "black-box" setting, meaning it doesn't need any internal access or knowledge of the AI's architecture. The only information it requires is a single bit of feedback per query: whether the input was blocked or not. - Before this automated attack, Anthropic's Constitutional Classifiers had successfully resisted over 3,700 hours of intensive human-led red-teaming efforts, which found only one fully successful jailbreak. - The discovery was made and disclosed by the UK's AI Safety Institute (AISI), whose Red Team has spent the last two years testing models from top AI companies to find and help fix vulnerabilities. - Boundary Point Jailbreaking works by creating a "curriculum" of attack prompts, starting with heavily disguised harmful text and progressively making it clearer. This allows the system to slowly find the "boundary" of what the safety classifier will flag as harmful. - Researchers note that defending against this type of attack is difficult for systems that only evaluate one user interaction at a time. They recommend moving towards "batch-level monitoring" to detect suspicious patterns, like a high number of flagged queries coming from a single account. - Anthropic's "Constitutional AI" is a safety approach where the model is trained on a list of principles—the "constitution"—that defines acceptable and unacceptable content, which in turn trains the safety classifiers that were bypassed. - OpenAI's GPT-5 classifier is part of a multi-layered "defense-in-depth" safety system. In October 2025, the company released open-weight versions of its safety models, called gpt-oss-safeguard, to allow developers to build their own safety classifiers.