Critics call some safety 'theatre'

Commentators argued that many high‑profile safety demos are engineered and don't expose real emergent risks, labeling them 'AI Safety Theatre' in social posts (x.com). At the same time, a technical paper cited failures in late‑layer MLP components as a concrete failure mode researchers are flagging (x.com).

The fight over artificial intelligence safety has split into two tracks: critics say some public demos are staged, while researchers are mapping specific failure points inside models. (internationalaisafetyreport.org) One flashpoint came on January 6, 2026, when TIME reported that CivAI showed Washington officials an app that appeared to coax Gemini 2.0 Flash and Claude 3.5 Sonnet into giving bioweapon and bomb-making instructions. Google said it could not verify the research without reviewing it, and Anthropic had previously said Claude 3.5 Sonnet did not cross its danger threshold in its own “uplift trials.” (time.com) That kind of demo is what some online critics have started calling “AI safety theatre”: a claim that headline-grabbing tests often rely on hand-picked prompts, older model versions, or setups that do not show how frontier systems fail in ordinary use. The criticism targets the format as much as the result, arguing that a dramatic jailbreak clip is not the same thing as a measured risk evaluation. (time.com) The larger backdrop is that major labs already run formal safety programs. OpenAI said in its April 15, 2025 Preparedness Framework update that it evaluates frontier capabilities for severe-harm risks using criteria including whether a risk is plausible, measurable, severe, net new, and hard to remedy. (openai.com) Anthropic, meanwhile, says its Responsible Scaling Policy has been in place since September 2023 and was updated to version 3.1 on April 2, 2026. The company says the policy is a living document and logs revisions publicly as it changes its thresholds, governance rules, and safeguard plans. (anthropic.com) Researchers are also pushing the debate away from demos and into the model itself. A 2024 paper on “safety layers” said aligned large language models rely on a small set of contiguous middle layers that are crucial for recognizing and refusing malicious queries. (arxiv.org) A newer April 2026 paper, CRaFT, focused on refusal behavior in the multilayer perceptron parts of transformers, the feed-forward blocks that help turn a model’s intermediate signals into decisions. The authors said their method ranked features by causal influence on the refusal-versus-compliance decision and raised attack success on Gemma-3-1B-it from 6.7 percent to 48.2 percent. (arxiv.org) That technical line of work fits a broader pattern in recent safety research: instead of asking only whether a model can be tricked, papers are asking which internal components carry the “refuse” signal and how those components fail. The International AI Safety Report 2026 said real-world evidence for several risks is growing and that layered approaches offer more robust risk management. (internationalaisafetyreport.org) So the argument now is less about whether artificial intelligence can fail than about what counts as proof. One side points to polished public demos; the other is trying to show, layer by layer, where a model’s safety behavior actually breaks. (arxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.