Artificial Analysis launches coding agent index

- Artificial Analysis launched a public Coding Agent Index on May 12, ranking full agent setups — model plus harness — across three software benchmarks. - Cursor CLI with Claude Opus 4.7 led at 61, just ahead of Codex with GPT-5.5 and Claude Code with Opus 4.7 at 60. - The gap was efficiency: per-task cost ran from $0.07 to $2.26, while average runtime stretched from 5.8 to 41.5 minutes.

Coding agents are turning into full software stacks, not just smart autocomplete with a better chat box. That makes benchmarking messy, because the model matters, but the harness around it matters too — how it plans, edits files, runs tests, and decides when to stop. Artificial Analysis is trying to make that easier to compare. On May 12, it launched a public Coding Agent Index that scores complete agent setups across three different kinds of engineering work. ### What is this index actually measuring? It is a composite score for coding agents, not raw models. Artificial Analysis averages pass@1 performance across SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA. Those three cover different jobs: making repository changes that pass tests, handling terminal-heavy workflows, and answering deep questions about a codebase. The whole point is to avoid the usual trap where one benchmark quietly becomes “coding ability” by itself. (artificialanalysis.ai) ### Why use three benchmarks? Because real engineering work is lopsided. One task asks an agent to fix or implement code inside a repo. Another asks it to survive an actual terminal session with multi-step commands, environment setup, and tool use. Another asks whether it can understand a large codebase well enough to explain behavior and architecture. An agent that looks great on one of those can still be weak on the others. (artificialanalysis.ai) That is the gap this index is trying to expose. ### Who came out on top? The current leader is Cursor CLI paired with Claude Opus 4.7 at 61. Right behind it are Codex with GPT-5.5 at 60 and Claude Code with Opus 4.7 at 60. Cursor CLI with GPT-5.5 lands at 58, then there is a drop to Claude Code with GLM-5.1 at 53, Claude Code with Kimi K2.6 and DeepSeek V4 Pro at 50, and Gemini CLI with Gemini 3.1 Pro at 43. So the top is crowded, but not flat. (artificialanalysis.ai) ### Why is the harness such a big deal? Because the same model can move around a lot depending on the wrapper. Opus 4.7 scores 61 in Cursor CLI and 60 in Claude Code. GPT-5.5 scores 60 in Codex but 58 in Cursor CLI. That sounds small, but when the leaderboard is this compressed, a couple of points is the difference between first place and the second tier. Basically, the harness is the operating system for the model’s judgment. (artificialanalysis.ai) ### Where do the tradeoffs show up? In cost and time. Claude Code with Opus 4.7 is the fastest listed setup at 5.8 minutes per task. Cursor CLI with GPT-5.5 is close at 6.2 minutes, and Codex with GPT-5.5 averages 7.1 minutes. But cost swings much harder: Cursor CLI with Composer 2 is $0.07 per task, while Claude Code with GLM-5.1 is $2.26 and Codex with GPT-5.5 is $2.21. That is roughly a 32x spread in cost and about a 7x spread in runtime from fastest to slowest. (artificialanalysis.ai) ### Why does that matter more than the headline score? Because teams do not buy benchmark points. They buy throughput. If one setup is 1 point worse but 10x cheaper, or nearly as good but much faster, that can be the better production choice. The index helps with the first cut, but the real decision lives in the efficiency tables underneath it. Artificial Analysis is pretty explicit about that — similar headline scores can hide very different strengths and operating costs. (artificialanalysis.ai) ### Is this replacing model leaderboards? Not really. It is adding a missing layer. Traditional leaderboards mostly rank base models. This one ranks working agent configurations — the model plus the toolchain and execution strategy. That is closer to how developers actually use these systems now, whether through Cursor, Codex, Claude Code, Gemini CLI, or something else. (artificialanalysis.ai) ### Bottom line? The useful idea here is simple: “best coding model” is no longer a complete question. The more relevant question is which full agent stack gets the job done reliably, fast enough, and cheap enough. Artificial Analysis just gave that question a public scoreboard. (artificialanalysis.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.