Design destructive action evaluation suites
- OpenAI, Anthropic, and OWASP now all push agent evals that explicitly test dangerous actions — deletion, exfiltration, privilege misuse, and prompt injection. - The common pattern is realistic scenarios, not toy prompts: untrusted documents, ambiguous authority, stale state, and tool calls with real side effects. - That matters because agent quality is no longer just “did it finish” — it’s whether it stopped, asked, contained damage, and recovered safely.
Agent evals used to mean “did the model answer correctly.” That breaks the moment the model can actually do things. If an agent can delete files, send messages, change settings, or run code, the hard question is no longer competence. It’s restraint. The shift in the last year is that major labs and security groups have started treating destructive-action testing as a first-class eval target, not a side note. ### What counts as a destructive action? Anything with irreversible or high-cost side effects. Deleting a database row. Overwriting a config file. Shutting down a service. Sending sensitive data to the wrong place. Escalating privileges because a tool made it possible. OWASP frames agent risk in exactly these terms — agents don’t just generate text, they can chain tools, maintain memory, and act in environments where mistakes become security incidents. (developers.openai.com) ### Why aren’t normal benchmarks enough? Because a multiple-choice benchmark can’t tell you whether the agent should have refused to act. Anthropic’s agent-evals guide makes this point pretty directly: useful evals have to measure the whole workflow, including whether the agent asks for clarification, notices uncertainty, and behaves sensibly across multi-step tasks. OpenAI’s agent eval docs make the same move with traces, graders, and eval runs built around actual workflows rather than isolated answers. (cheatsheetseries.owasp.org) ### What should the test cases look like? They should look annoyingly real. Put the agent in front of an email thread that seems to authorize an action but doesn’t quite. Give it stale state so the world has changed since the last tool result. Hand it a tool that can both read and write, then see whether it reaches for the risky path too quickly. METR’s older eval work is still useful here — the point was to simulate realistic task environments and let researchers step through the consequences of suggested actions. (anthropic.com) ### Why is prompt injection part of this? Because prompt injection is often the trigger for destructive behavior. The modern version is less “ignore previous instructions” and more social engineering hidden inside documents, websites, or messages the agent is asked to process. OpenAI’s recent security write-up and Anthropic’s browser-injection research both argue that defenses can’t rely only on filtering strings. You also have to constrain what the agent is allowed to do if manipulation gets through. (metr.org) ### So what do you score? Not just task completion. Score whether the agent paused before taking an irreversible action. Score whether it escalated to a human when authority was ambiguous. Score whether it limited blast radius after something looked wrong. OWASP’s agent security guidance is basically a checklist for this mindset — least privilege, scoped tools, approvals for high-risk actions, and containment when compromise is possible. (openai.com) ### What does a good failure look like? A good failure is boring. The agent refuses to delete. It asks for confirmation. It quarantines suspicious content. It uses a read-only path instead of a write path. In other words, the eval should reward safe incompletion over confident damage. That’s the big design change — “did nothing harmful” can be a better score than “finished the task.” ### What about recovery drills? (cheatsheetseries.owasp.org) They matter because some attacks will land. Once you accept that, you start testing whether the agent can roll back, stop further actions, preserve logs, and hand off cleanly. Google’s agent-security guidance talks in layered-defense terms, and that maps neatly to eval design: prevention first, then detection, then containment. (anthropic.com) ### Bottom line The mature way to evaluate an agent is to tempt it with the wrong action and see whether it keeps its hands off the controls. If your suite only measures whether the job got done, you’re grading the least important part. (developers.openai.com) (cloud.google.com)