OpenAI Releases EVMbench to Test AI for Smart Contract Security
OpenAI has launched “EVMbench,” a benchmark designed to test the ability of AI agents to detect, patch, and exploit smart contract vulnerabilities. The tool pits different AI models against one another in security scenarios, responding to the need for advanced auditing tools after over $3.4 billion was lost to crypto hacks in the last year. This marks a significant step in developing AI-powered infrastructure for on-chain security.
- EVMbench was developed as a collaboration between OpenAI and crypto venture capital firm Paradigm. The benchmark's dataset includes 120 curated, high-severity vulnerabilities sourced from 40 real-world audits, primarily from public competitions like Code4rena and security work done for the Tempo blockchain. - The benchmark evaluates AI models on three distinct tasks: `Detect`, where the agent must identify known flaws; `Patch`, where it must fix a vulnerability without breaking functionality; and `Exploit`, where it must execute a fund-draining attack in a sandboxed environment. - In initial tests, OpenAI's GPT-5.3-Codex model successfully executed exploits against 72.2% of the vulnerable contracts. This marks a significant improvement from an earlier GPT-5 model that scored only 31.9% just six months prior. - While the AI showed strong performance in exploitation, its capabilities for detection and patching are less developed. Models often stop after finding a single issue rather than performing a comprehensive audit, and they struggle to fix subtle bugs without breaking the contract's core logic. - Paradigm has extended the open-source benchmark framework into a usable auditing agent that developers can use. The project's stated goal is to accelerate a future where a growing portion of smart contract audits are performed by AI agents. - The initiative is part of a broader effort by OpenAI to bolster cyber defenses using AI. The company has committed $10 million in API credits for defensive security projects and is expanding the private beta for Aardvark, its AI security research agent.