DeepSeek tops SWE‑bench Verified
- DeepSeek-V4-Pro moved to No. 1 on Hugging Face’s SWE-bench Verified leaderboard, edging Kimi-K2.6 on a 48-model benchmark page for coding models. - The key number is 80.6% resolved for DeepSeek-V4-Pro, versus 80.2% for Kimi-K2.6, 79.0% for DeepSeek-V4-Flash, and 78.9% for Xiaomi MiMo-V2.5-Pro. - It matters because the leaderboard is now Hub-native and community-fed, even as OpenAI says Verified is contaminated for frontier evaluation.
Coding benchmarks are weirdly political now. Not in the government sense — in the “who gets to define progress” sense. That is why this leaderboard move matters. DeepSeek-V4-Pro now sits at the top of Hugging Face’s SWE-bench Verified benchmark page with an 80.6% resolved rate, just ahead of Moonshot AI’s Kimi-K2.6 at 80.2%, while DeepSeek-V4-Flash, Xiaomi MiMo-V2.5-Pro, and Z.ai’s GLM-5 round out the next spots. (huggingface.co) ### What is SWE-bench Verified, exactly? It is a 500-task subset of SWE-bench built from real GitHub issues in open-source Python repos. The point is simple — give a model a repo and an issue, then see whether it can generate a patch that makes the hidden tests pass. Verified was introduced as a cleaner version of the original benchmark, with human review to remove bad or ambiguous tasks. (swebench.com) ### So what changed this week? The visible change is the ranking on the Hugging Face benchmark page. Right now that page lists 48 models, and the top five are DeepSeek-V4-Pro at 80.6, Kimi-K2.6 at 80.2, DeepSeek-V4-Flash at 79.0, Xiaomi MiMo-V2.5-Pro at 78.9, and GLM-5 at 77.8. That is the concrete news — not a vague “open models are improving,” but a specific leaderboard where mostly Chinese labs and open-weight releases are clustered at the top. (huggingface.co) ### Why is Hugging Face in the middle of this? Because Hugging Face changed how benchmark results can show up on the Hub. Its eval-results system lets model repos store benchmark YAML files in `.eval_results/`, and benchmark datasets automatically aggregate those scores into a leaderboard. Basically, instead of one company running a private spreadsheet, the benchmark page can(huggingface.co)big part of why this story exists at all. (huggingface.co) ### Are these all apples-to-apples results? Not perfectly. The SWE-bench site itself says its full Verified leaderboard mixes many kinds of coding systems — simple agent loops, RAG systems, multi-rollout setups, and review-style systems. It also offers a separate “bash only” mini-SWE-agent setting for cleaner language-model comparisons. On the Hub page, some scores are pulled from model c(huggingface.co)you inspect each submission. (swebench.com) ### Then why do people still care? Because SWE-bench still maps to a real use case — can a model work through an actual software bug instead of just spitting out a toy code snippet? A leaderboard win here signals practical coding ability more than a lot of multiple-choice evals do. And the top of this list is no longer dominated by a single closed vendor. That changes how people shop for coding mode(swebench.com)osting. (huggingface.co) ### What is the catch? OpenAI itself now says SWE-bench Verified should no longer be treated as a frontier coding yardstick. In February 2026, it argued the benchmark is increasingly contaminated — meaning models may have seen too much of the underlying data and even solution patches during training — and that some remaining tasks have flawed tests. OpenAI’s recommendation now is to use SWE-bench Pro for frontier evaluation instead. (openai.com) ### Does that make this leaderboard meaningless? No — but it does change what the scores mean. Think of it less like a clean final exam and more like a public track meet on a course everyone has practiced on. You can still learn who is fast. But tiny gaps — 80.6 versus 80.2 — are not the same thing as a definitive statement about the best coding model in the world. (op([openai.com)# Bottom line? The real story is not just that DeepSeek is first. It is that benchmark power is moving toward open, Hub-native, community-submitted evaluation plumbing — right as the benchmark itself is being questioned. That tension is the whole picture. (huggingface.co)