DeepSeek tops SWE‑bench Verified

- DeepSeek-V4-Pro moved to No. 1 on Hugging Face’s SWE-bench Verified leaderboard, edging Kimi-K2.6 on a 48-model benchmark page for coding models. - The key number is 80.6% resolved for DeepSeek-V4-Pro, versus 80.2% for Kimi-K2.6, 79.0% for DeepSeek-V4-Flash, and 78.9% for Xiaomi MiMo-V2.5-Pro. - It matters because the leaderboard is now Hub-native and community-fed, even as OpenAI says Verified is contaminated for frontier evaluation.

Coding benchmarks are weirdly political now. Not in the government sense — in the “who gets to define progress” sense. That is why this leaderboard move matters. DeepSeek-V4-Pro now sits at the top of Hugging Face’s SWE-bench Verified benchmark page with an 80.6% resolved rate, just ahead of Moonshot AI’s Kimi-K2.6 at 80.2%, while DeepSeek-V4-Flash, Xiaomi MiMo-V2.5-Pro, and Z.ai’s GLM-5 round out the next spots. (huggingface.co) ### What is SWE-bench Verified, exactly? It is a 500-task subset of SWE-bench built from real GitHub issues in open-source Python repos. The point is simple — give a model a repo and an issue, then see whether it can generate a patch that makes the hidden tests pass. Verified was introduced as a cleaner version of the original benchmark, with human review to remove bad or ambiguous tasks. (swebench.com) ### So what changed this week? The visible change is the ranking on the Hugging Face benchmark page. Right now that page lists 48 models, and the top five are DeepSeek-V4-Pro at 80.6, Kimi-K2.6 at 80.2, DeepSeek-V4-Flash at 79.0, Xiaomi MiMo-V2.5-Pro at 78.9, and GLM-5 at 77.8. That is the concrete news — not a vague “open models are improving,” but a specific leaderboard where mostly Chinese labs and open-weight releases are clustered at the top. (huggingface.co) ### Why is Hugging Face in the middle of this? Because Hugging Face changed how benchmark results can show up on the Hub. Its eval-results system lets model repos store benchmark YAML files in `.eval_results/`, and benchmark datasets automatically aggregate those scores into a leaderboard. Basically, instead of one company running a private spreadsheet, the benchmark page can(huggingface.co)big part of why this story exists at all. (huggingface.co) ### Are these all apples-to-apples results? Not perfectly. The SWE-bench site itself says its full Verified leaderboard mixes many kinds of coding systems — simple agent loops, RAG systems, multi-rollout setups, and review-style systems. It also offers a separate “bash only” mini-SWE-agent setting for cleaner language-model comparisons. On the Hub page, some scores are pulled from model c(huggingface.co)you inspect each submission. (swebench.com) ### Then why do people still care? Because SWE-bench still maps to a real use case — can a model work through an actual software bug instead of just spitting out a toy code snippet? A leaderboard win here signals practical coding ability more than a lot of multiple-choice evals do. And the top of this list is no longer dominated by a single closed vendor. That changes how people shop for coding mode(swebench.com)osting. (huggingface.co) ### What is the catch? OpenAI itself now says SWE-bench Verified should no longer be treated as a frontier coding yardstick. In February 2026, it argued the benchmark is increasingly contaminated — meaning models may have seen too much of the underlying data and even solution patches during training — and that some remaining tasks have flawed tests. OpenAI’s recommendation now is to use SWE-bench Pro for frontier evaluation instead. (openai.com) ### Does that make this leaderboard meaningless? No — but it does change what the scores mean. Think of it less like a clean final exam and more like a public track meet on a course everyone has practiced on. You can still learn who is fast. But tiny gaps — 80.6 versus 80.2 — are not the same thing as a definitive statement about the best coding model in the world. (op([openai.com)# Bottom line? The real story is not just that DeepSeek is first. It is that benchmark power is moving toward open, Hub-native, community-submitted evaluation plumbing — right as the benchmark itself is being questioned. That tension is the whole picture. (huggingface.co)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.