New safety‑testing lifts catches

New evaluation pipelines that use execution traces, audit logs and state snapshots — not just raw LLM outputs — caught 44% more safety violations and 13% more robustness failures in recent tests, suggesting richer telemetry matters for model safety. ( ).

Most artificial intelligence safety tests still grade the last line on the screen, like judging a bank robbery by whether the thief smiled before leaving. A new evaluation system found that this misses what the model actually did along the way. (arxiv.org) That “along the way” data is called telemetry: the record of each tool call, file edit, click, and state change an agent makes while it works. In the new benchmark, every task was recorded through execution traces, audit logs, and environment snapshots instead of just saving the final answer. (arxiv.org) An execution trace is a step-by-step receipt of what the system did, the way a package tracker shows every stop between warehouse and doorstep. An audit log is a tamper-resistant activity record, and an environment snapshot is a freeze-frame of the system state at a specific moment. (sciencedirect.com, microsoft.com, crowdstrike.com) That extra visibility changes the score. The paper says output-only grading missed 44% of safety violations and 13% of robustness failures that the richer “trajectory-aware” pipeline caught. (arxiv.org) The benchmark behind that result is called Claw-Eval, and it uses 300 human-verified tasks across 9 categories. The researchers tested 14 frontier models in software environments where agents had to do multi-step work instead of answer one-shot questions. (arxiv.org) This matters most for agents, which are language models that can use tools, browse files, query databases, and act across apps. Once a model can take actions, a harmless-looking final reply can hide a dangerous intermediate step like opening the wrong file or sending the wrong command. (arxiv.org) Recent OpenAI safety work points at the same problem from another angle: models often receive conflicting instructions from system rules, developers, users, and tool outputs. OpenAI said on March 10, 2026 that training models to respect that instruction order improves safety steerability and prompt-injection robustness. (openai.com) OpenAI’s new Safety Bug Bounty, launched on March 25, 2026, also centers on agent behavior rather than polished text. Its in-scope reports include third-party prompt injection and data exfiltration when attacker text hijacks an agent into harmful actions or leaking sensitive information. (openai.com) The same company already describes “abuse monitoring logs” and “application state” as separate kinds of platform data, which is the operational version of this research result. If you want to know whether an artificial intelligence system stayed inside the rules, the answer is increasingly in the logs and state records, not just the chat bubble the user sees. (developers.openai.com)

New safety‑testing lifts catches

Get your own daily briefing