DeepSWE shows GPT‑5.5 lead

Published by The Daily Scout

What happened

- DeepSWE, a developer‑task coding benchmark built from original multi‑file problems, published fresh cross‑model results on realistic SWE work. - GPT‑5.5 reached 70% pass@1, GPT‑5.4 56%, Claude Opus 4.7 54%, Claude Sonnet 4.6 32% and Gemini 3.5 Flash 28% across the dataset. - The dataset and methodology are openly published on GitHub so teams can evaluate models on heavier, multi‑file engineering tasks. (x.com)

Why it matters

1/ DeepSWE is a new coding benchmark aimed at a specific complaint about current evals: frontier models are bunching together on public SWE tests, while developers say their day-to-day coding performance still feels meaningfully different. DeepSWE’s first public leaderboard puts GPT‑5.5 at 70% pass@1, ahead of GPT‑5.4 at 56% and Claude Opus 4.7 at 54%. (deepswe.datacurve.ai) 2/ What makes it different is the task design. DeepSWE says its benchmark uses 113 original, long-horizon software engineering tasks drawn from active open-source repositories, spanning TypeScript, Go, Python, JavaScript and Rust, with isolated environments and program-based verifiers. (github.com) 3/ The contamination point matters. DeepSWE says its tasks are written from scratch rather than adapted from existing commits or pull requests, so the benchmark is intended to reduce the chance that a model saw the answer during pretraining. That is one of the biggest recurring arguments around coding benchmarks now. (deepswe.datacurve.ai) 4/ The benchmark is also trying to measure a heavier kind of work than “write a function” coding tests. DeepSWE says its prompts are shorter and more behavior-focused, but the underlying tasks require substantially more code than SWE-bench Pro, with agents needing to discover where and how to implement a change rather than follow an over-specified issue description. (deepswe.datacurve.ai) 5/ The published leaderboard shows a wider spread than many headline coding evals. After GPT‑5.5 at 70%, the next group is GPT‑5.4 at 56% and Claude Opus 4.7 at 54%, then a drop to Claude Sonnet 4.6 at 32% and Gemini 3.5 Flash at 28%. DeepSWE also lists confidence intervals alongside those scores. (deepswe.datacurve.ai) 6/ That gap is the story here. On this benchmark, GPT‑5.5 is not just narrowly ahead; it is 14 points above GPT‑5.4 and 16 above Claude Opus 4.7. DeepSWE’s authors say the benchmark was built to “separate” frontier models where other public tests increasingly overlap. (deepswe.datacurve.ai) 7/ There is an important caveat in the setup: all models were run with the same agent scaffold, mini-swe-agent. That helps comparability, but it also means the results reflect model-plus-harness performance under one shared evaluation recipe, not some pure, tool-free measure of coding ability. (github.com) 8/ The verification design is another key detail. DeepSWE says its verifiers are hand-written to test software behavior rather than implementation details, and that acceptable solutions are graded by observable correctness rather than matching a reference patch. That is meant to reward working fixes, not exact replicas. (github.com) 9/ The benchmark authors are also making an explicit criticism of older public evals. In the DeepSWE blog, they say SWE-bench Pro tasks average 120 lines of code to solve and claim an internal audit found verifier misgrading rates of 8% false positives and 24% false negatives. That is their argument for building a new benchmark, though that specific audit claim comes from DeepSWE’s own write-up. (deepswe.datacurve.ai) 10/ For teams deciding whether to care, the practical part is that the benchmark and task corpus are public. DeepSWE links to its GitHub repo, publishes task format details, and says users can browse trajectories or run their own agents against the benchmark. That makes this less of a one-off leaderboard screenshot and more of a reproducible eval artifact. (github.com) 11/ The broader takeaway is narrow but useful: if you care about realistic multi-file engineering work, DeepSWE is trying to test a harder regime than saturated coding benchmarks. The initial result favors GPT‑5.5 by a clear margin, but the more durable contribution may be the open benchmark design and the push toward contamination-resistant, behavior-graded SWE evaluation. (deepswe.datacurve.ai)

Key numbers

  • GPT‑5.5 reached 70% pass@1, GPT‑5.4 56%, Claude Opus 4.7 54%, Claude Sonnet 4.6 32% and Gemini 3.5 Flash 28% across the dataset.
  • (x.com) 1/ DeepSWE is a new coding benchmark aimed at a specific complaint about current evals: frontier models are bunching together on public SWE tests, while developers say their day-to-day coding performance still feels meaningfully different.
  • DeepSWE’s first public leaderboard puts GPT‑5.5 at 70% pass@1, ahead of GPT‑5.4 at 56% and Claude Opus 4.7 at 54%.
  • (deepswe.datacurve.ai) 2/ What makes it different is the task design.

What happens next

  • After GPT‑5.5 at 70%, the next group is GPT‑5.4 at 56% and Claude Opus 4.7 at 54%, then a drop to Claude Sonnet 4.6 at 32% and Gemini 3.5 Flash at 28%.
  • The initial result favors GPT‑5.5 by a clear margin, but the more durable contribution may be the open benchmark design and the push toward contamination-resistant, behavior-graded SWE evaluation.

Quick answers

What happened in DeepSWE shows GPT‑5.5 lead?

DeepSWE, a developer‑task coding benchmark built from original multi‑file problems, published fresh cross‑model results on realistic SWE work. GPT‑5.5 reached 70% pass@1, GPT‑5.4 56%, Claude Opus 4.7 54%, Claude Sonnet 4.6 32% and Gemini 3.5 Flash 28% across the dataset. The dataset and methodology are openly published on GitHub so teams can evaluate models on heavier, multi‑file engineering tasks. (x.com)

Why does DeepSWE shows GPT‑5.5 lead matter?

1/ DeepSWE is a new coding benchmark aimed at a specific complaint about current evals: frontier models are bunching together on public SWE tests, while developers say their day-to-day coding performance still feels meaningfully different. DeepSWE’s first public leaderboard puts GPT‑5.5 at 70% pass@1, ahead of GPT‑5.4 at 56% and Claude Opus 4.7 at 54%. (deepswe.datacurve.ai) 2/ What makes it different is the task design. DeepSWE says its benchmark uses 113 original, long-horizon software engineering tasks drawn from active open-source repositories, spanning TypeScript, Go, Python, JavaScript and Rust, with isolated environments and program-based verifiers. (github.com) 3/ The contamination point matters. DeepSWE says its tasks are written from scratch rather than adapted from existing commits or pull requests, so the benchmark is intended to reduce the chance that a model saw the answer during pretraining. That is one of the biggest recurring arguments around coding benchmarks now. (deepswe.datacurve.ai) 4/ The benchmark is also trying to measure a heavier kind of work than “write a function” coding tests. DeepSWE says its prompts are shorter and more behavior-focused, but the underlying tasks require substantially more code than SWE-bench Pro, with agents needing to discover where and how to implement a change rather than follow an over-specified issue description. (deepswe.datacurve.ai) 5/ The published leaderboard shows a wider spread than many headline coding evals. After GPT‑5.5 at 70%, the next group is GPT‑5.4 at 56% and Claude Opus 4.7 at 54%, then a drop to Claude Sonnet 4.6 at 32% and Gemini 3.5 Flash at 28%. DeepSWE also lists confidence intervals alongside those scores. (deepswe.datacurve.ai) 6/ That gap is the story here. On this benchmark, GPT‑5.5 is not just narrowly ahead; it is 14 points above GPT‑5.4 and 16 above Claude Opus 4.7. DeepSWE’s authors say the benchmark was built to “separate” frontier models where other public tests increasingly overlap. (deepswe.datacurve.ai) 7/ There is an important caveat in the setup: all models were run with the same agent scaffold, mini-swe-agent. That helps comparability, but it also means the results reflect model-plus-harness performance under one shared evaluation recipe, not some pure, tool-free measure of coding ability. (github.com) 8/ The verification design is another key detail. DeepSWE says its verifiers are hand-written to test software behavior rather than implementation details, and that acceptable solutions are graded by observable correctness rather than matching a reference patch. That is meant to reward working fixes, not exact replicas. (github.com) 9/ The benchmark authors are also making an explicit criticism of older public evals. In the DeepSWE blog, they say SWE-bench Pro tasks average 120 lines of code to solve and claim an internal audit found verifier misgrading rates of 8% false positives and 24% false negatives. That is their argument for building a new benchmark, though that specific audit claim comes from DeepSWE’s own write-up. (deepswe.datacurve.ai) 10/ For teams deciding whether to care, the practical part is that the benchmark and task corpus are public. DeepSWE links to its GitHub repo, publishes task format details, and says users can browse trajectories or run their own agents against the benchmark. That makes this less of a one-off leaderboard screenshot and more of a reproducible eval artifact. (github.com) 11/ The broader takeaway is narrow but useful: if you care about realistic multi-file engineering work, DeepSWE is trying to test a harder regime than saturated coding benchmarks. The initial result favors GPT‑5.5 by a clear margin, but the more durable contribution may be the open benchmark design and the push toward contamination-resistant, behavior-graded SWE evaluation. (deepswe.datacurve.ai)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.