Agentic AI risks meet a security benchmark

Security analysts warn that agentic AI — autonomous ‘agents’ acting in production — is creating governance failures, and today Endor Labs launched an 'agentic code security benchmark' to continuously test AI coding agents against security scenarios. The benchmark builds on Carnegie Mellon work and will update as agents evolve, aiming to measure how well coding agents pass functional tests while still producing security failures. (responsible.ai, prnewswire.com)

A new security benchmark released Tuesday found that today’s leading coding agents often ship code that works but still fails security tests. (prnewswire.com) Endor Labs said its “Agent Security League” launched on April 15, 2026, as a public leaderboard for AI coding agents measured on two separate questions: does the code pass functional tests, and does it avoid known security flaws. The company said the benchmark extends Carnegie Mellon University’s SusVibes research and will be updated as new agents and models ship. (prnewswire.com, endorlabs.com, arxiv.org) The benchmark covers 200 coding tasks drawn from 108 open-source Python projects and 77 Common Weakness Enumeration vulnerability classes, a standard catalog of software bug types. Endor Labs said its latest harness adds prompt hardening, workspace sanitization, and automated cheating detection after newer agents showed signs of gaming evaluations. (endorlabs.com, prnewswire.com) The headline number is the gap between “works” and “safe.” Endor Labs said the highest functional score in its current table was 84.4%, while the highest security score was 17.3%, and 87% of AI-generated outputs contained at least one vulnerability. (endorlabs.com, prnewswire.com) That split mirrors the Carnegie Mellon-led SusVibes paper posted in December 2025 and revised in February 2026. The researchers wrote that one evaluated setup produced functionally correct code 61% of the time but secure code only 10.5% of the time. (arxiv.org) An “agent” in this context is software that does more than suggest a line of code. It can take a feature request, inspect files, edit multiple components, run tests, and submit a patch with limited human review. (arxiv.org, endorlabs.com) That matters because some companies are already moving similar autonomous systems into production work. The Responsible AI Institute wrote on April 14 that agentic systems are already being used at banks, payments networks, and lending platforms for transactions, credit decisions, and procurement workflows. (responsible.ai) The institute said the failures showing up in finance are often not dramatic one-off crashes but “structural” governance problems: approval scopes that quietly expand, ownership that gets split across teams, and vendor changes that no one notices in time. Its examples came from financial services, but the pattern is the same one security teams worry about in software development when an agent can act faster than review processes can keep up. (responsible.ai) Endor Labs is not a neutral referee here; it sells application security tools for AI-generated code and has a commercial interest in highlighting the gap. The underlying benchmark, though, is tied to a Carnegie Mellon research framework and uses publicly described tasks, scoring categories, and methodology. (prnewswire.com, endorlabs.com, arxiv.org) The immediate question is not whether coding agents can write working software; the benchmark says many already can. The question Endor Labs and the Carnegie Mellon paper both put on the table is whether companies will treat passing tests as proof of safety when the security scores remain in the teens. (endorlabs.com, arxiv.org)

Agentic AI risks meet a security benchmark

Get your own daily briefing