One‑line AI jailbreak

Security researchers circulated a proof‑of‑concept showing a single line of code can 'jailbreak' at least 11 AI models, including ChatGPT and Claude, by using sockpuppet accounts to bypass safety filters (x.com). The demonstration spread quickly on social platforms, illustrating how minimal prompt‑engineering combined with identity spoofing can defeat model guardrails across different providers (x.com).

A newly publicized jailbreak method lets attackers steer some major artificial intelligence models with a single injected line, if the application programming interface lets them prefill the model’s reply. (trendmicro.com) Trend Micro published the technique on April 10, 2026 and said it tested 11 large language model assistants across four providers. Every model that accepted the prefill was at least partly vulnerable, including GPT-4o, Claude 4 Sonnet, and Gemini 2.5 Flash. (trendmicro.com) The trick is called “sockpuppeting.” It works by inserting a fake assistant acceptance line before the model answers, so the system continues as if it had already agreed to comply. (trendmicro.com; arxiv.org) A jailbreak is a way to make a model ignore safety rules and produce material it should refuse. Microsoft said in June 2024 that jailbreaks are one of the core ways attackers can make generative artificial intelligence systems violate operator policies. (microsoft.com) This case centers on a common developer feature called prefilling, which is like starting the model’s sentence for it and letting it finish. Anthropic’s documentation describes prefilling by placing starter text in the assistant message and having Claude continue from there. (platform.claude.com; docs.aws.amazon.com) Trend Micro said Gemini 2.5 Flash had the highest attack success rate in its tests at 15.7 percent, while GPT-4o-mini had the lowest among affected models at 0.5 percent. The company said three models were blocked at the application programming interface layer before the attack could run. (trendmicro.com) The underlying paper was posted to arXiv on January 19, 2026 by Asen Dotsinski and Panagiotis Eustratiadis. It reported much higher success rates on some open-weight models, including 95 percent on Qwen-8B and 77 percent on Llama-3.1-8B. (arxiv.org; trendmicro.com) Trend Micro said one practical fix is to reject assistant-role prefills at the application programming interface layer. It said OpenAI, Amazon Web Services Bedrock, and Anthropic for Claude 4.6 already block that pattern in some cases, though the company also said the protection is not universal across providers and endpoints. (trendmicro.com; developers.openai.com; docs.aws.amazon.com) The episode adds to a growing body of research showing that safety failures often sit in the software wrapped around a model, not only in the model itself. In this case, the shortest part of the prompt was the part that mattered most: one line that made the model sound as if it had already said yes. (microsoft.com; trendmicro.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.