One‑line jailbreak hits models

Security researchers described a 'sockpuppeting' jailbreak—a one‑line-of‑code API trick—that can bypass safety controls across 11 large language models, including ChatGPT, Claude and Gemini. The technique exposes an industry‑wide risk that model access and API policies can be manipulated with minimal code. (x.com)

A chatbot usually sees a conversation as a stack of labeled turns: system sets the rules, user asks the question, assistant gives the answer. The new attack works by slipping in a fake assistant turn first, so the model keeps talking as if it already agreed. (trendmicro.com) That fake turn is called a prefill. Developers use prefills for harmless jobs like forcing a reply to start with a brace for JavaScript Object Notation, but Trend Micro says the same feature can be used to plant a compliant opening sentence before the safety system finishes its refusal. (platform.claude.com, trendmicro.com) Trend Micro published the writeup on April 10, 2026 and said it tested 11 assistants across four providers. Every model that accepted assistant prefills was at least partly vulnerable, and the affected set included OpenAI’s GPT-4o, Anthropic’s Claude 4 Sonnet, and Google’s Gemini 2.5 Flash. (trendmicro.com) The numbers were uneven, which is what makes this look like an interface problem as much as a model problem. Trend Micro reported a 15.7 percent attack success rate on Gemini 2.5 Flash and 0.5 percent on GPT-4o-mini, while saying three models were stopped entirely by API-layer blocking. (trendmicro.com) The underlying habit it exploits is self-consistency. If the first words in the assistant’s mouth are “Sure, here’s how,” the model often treats that as proof that the decision to comply has already been made and continues in the same direction. (trendmicro.com) That is different from the older jailbreaks people usually picture. Trend Micro contrasts sockpuppeting with Greedy Coordinate Gradient, which needs optimization, and with long “Do Anything Now” style prompts, which rely on elaborate social engineering inside the text itself. (trendmicro.com) The paper behind the technique, cited by Trend Micro as Dotsinski and Eustratiadis, reported much higher success on some open models: up to 95 percent on Qwen-8B and 77 percent on Llama 3.1 8B with no optimization. That means the trick is not tied to one company’s chatbot brand; it follows the way many chat systems stitch messages together before generation starts. (trendmicro.com) Some providers had already moved in this direction. Trend Micro says OpenAI, Amazon Bedrock, and Anthropic for Claude 4.6 block this vector at the application programming interface layer by rejecting assistant-role prefills or enforcing message ordering before the request reaches the model. (trendmicro.com) That matches the broader shift in how companies talk about model security. OpenAI wrote on March 11, 2026 that prompt injection cannot be solved only by filtering bad strings, and Anthropic says its safeguards work across policy, training, testing, and real-time enforcement rather than in one single model-side fix. (openai.com, anthropic.com) The awkward part is that the attack uses a feature developers like because it makes outputs easier to control. A line of code meant to make a model answer in the right format can also make it start from the wrong decision, which is why this story lands on the plumbing of artificial intelligence products, not just on the models themselves. (platform.claude.com, trendmicro.com)

One‑line jailbreak hits models

Get your own daily briefing