SWE-bench posts agentic coding leaderboard

- SWE-bench’s official site now foregrounds an agentic coding leaderboard for SWE-bench Verified, comparing models and agents on 500 real GitHub issue-fix tasks. - The key number is the metric itself: percent of issues resolved end to end, on a human-validated 500-task subset run in a shared harness. - That matters because coding models are shifting from toy benchmarks to repo-level work, where navigation, patching, and test passing decide usefulness.

Coding benchmarks are getting more realistic — and SWE-bench just made that shift a lot more visible. Its official leaderboard now puts agentic coding front and center on SWE-bench Verified, a 500-task subset built from real GitHub issues in popular Python repositories. The point is simple: stop grading models on isolated code snippets and start grading them on whether they can actually land a working fix in a messy repo. That is the gap this leaderboard is trying to close. (swebench.com) ### What is SWE-bench actually measuring? SWE-bench is a software engineering benchmark built from real issue-and-fix pairs pulled from GitHub. A model gets a repository and an issue description, then has to produce a patch that resolves the problem. On the site, the main score is “% Resolved” — basically, how many tasks the system fully solves, not how good its explanation sounded. (swebench.com)e original problem with repo-level coding benchmarks was noise. Some tasks were ambiguous, some patches were questionable, and some issues were not clearly solvable from the provided context. SWE-bench Verified trims that down to 500 human-validated instances. Annotators checked that the issue description was clear, the test patch was correct, and the task was solvable with the availab(swebench.com)much more usable as a comparison tool instead of a vibes contest. (swebench.com) ### Why call this “agentic coding”? Because these tasks are not one-shot autocomplete. The system has to inspect files, reason about repo structure, edit code, and keep trying until tests pass or the run ends. SWE-bench’s own docs describe the leaderboard as covering everything from simple language-model loops to retrieval-heavy systems and multi-rollout review setups. In other words, the benchmark is testing workflow, not just syntax. (swebench.com) ### What changed on the leaderboard? The official site now makes the split between plain model evaluation and full agent evaluation much clearer. On SWE-bench Verified, you can compare arbitrary agents on the full leaderboard, or switch to a bash-only setup that runs language models through mini-SWE-agent for a more apples-to-apples comparison. That matters because scaffold quality can inflate results — a great wo(swebench.com) model with a weak loop. (swebench.com) ### What is mini-SWE-agent? It is SWE-bench’s deliberately stripped-down baseline — a minimal ReAct-style loop with just a bash shell and no fancy tool stack. The team says this setup is meant to compare language models directly, without special scaffolding. They also highlight that mini-SWE-agent hit 65% on SWE-bench Verified in roughly 100 lines of Python, which is a pretty pointed statement: some of the recent gains are com(swebench.com)m elaborate agent frameworks. (swebench.com) ### Why should engineers care? Because this is much closer to the work coding agents are being sold for — bug fixing, repo navigation, CI triage, and PR automation. A benchmark like HumanEval can tell you whether a model writes a neat function from scratch. SWE-bench Verified tells you whether it can survive contact with an existing codebase. If you are deciding what to trust in a real engineering workflow, that second question is the one that bites. (swebench.com) ### Is this the final answer on coding agents? Not quite. SWE-bench still focuses on Python repos, and even Verified is a curated subset rather than the whole universe of engineering work. But the direction is clear — the benchmark is evolving toward more realistic, reproducible, and agent-centered evaluation, with newer projects like CodeClash pushing even further toward goal-oriented development. (swebench.com) is turning coding leaderboards into something more like an engineering test. That does not make the rankings perfect — but it makes them a lot harder to game, and a lot more useful. (swebench.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.