Qwen 3.7 Max posts SWE‑bench 80.4, highlighting strong agentic performance
- Alibaba launched Qwen 3.7 Max on May 20, saying the model targets agentic workloads and posted 80.4 on SWE-bench Verified. - Alibaba said Qwen 3.7 Max ran autonomously for 35 hours, made 1,158 tool calls, and scored 69.7 on Terminal-Bench 2.0. - Qwen Code docs and Alibaba Cloud’s summit materials show the model is available now, with API access through Bailian/Model Studio.
Alibaba launched Qwen 3.7 Max at its Cloud Summit in Hangzhou on May 20, presenting the model as a system built for long-running agentic work rather than single-turn chat. The company said the model scored 80.4 on SWE-bench Verified and 69.7 on Terminal-Bench 2.0, two benchmarks used to test software engineering and terminal-based agent performance. Alibaba also said Qwen 3.7 Max supports a 1 million-token context window and can work inside external agent harnesses including Claude Code and OpenClaw. ### Why are those two benchmark numbers getting attention? SWE-bench Verified is a 500-problem subset of SWE-bench in which software engineers confirmed the issues are solvable, according to the benchmark’s GitHub repository. The benchmark measures whether a model can generate patches that resolve real GitHub issues in codebases. An 80.4 score, if reproduced under comparable settings, places Qwen 3.7 Max near the top end of reported software-engineering-agent results. (alibabacloud.com) Terminal-Bench 2.0 measures whether agents can complete end-to-end tasks in terminal environments, including debugging code, configuring systems and handling security or data workflows, according to the project site, repository and paper abstract. Alibaba’s reported 69.7 matters because the benchmark is designed around longer, tool-using tasks rather than short-answer coding prompts. (github.com) ### What exactly did Alibaba say the model can do? Alibaba said Qwen 3.7 Max is engineered for “sustained, multi-step operations” and highlighted a 35-hour internal test on its Zhenwu M890 chip. In that experiment, the company said the model was given a task on hardware it had not seen in training, made more than 1,000 tool calls, and produced a production-grade computing kernel that outperformed the chip maker’s official version by tenfold. (tbench.ai) AIHub, which summarized the launch materials in Chinese, reported a more specific figure of 1,158 tool calls and said the autonomous kernel-optimization run included 432 kernel evaluations. That same summary listed the model’s Terminal-Bench 2.0 score at 69.7 and said Qwen 3.7 Max is intended to serve as a base model for coding agents, office automation and long-duration autonomous tasks. (alibabacloud.com) ### How does Alibaba position the model inside agent tooling? Alibaba said Qwen 3.7 Max is optimized for agent harnesses including OpenClaw, Hermes Agent, Claude Code, Qwen Paw and Qoder. Qwen Code, the company’s own terminal-based coding agent, is documented separately as a tool for turning ideas into code from the command line. That framing suggests Alibaba is marketing the model as infrastructure for agent systems, not only as a chatbot endpoint. (aihub.cn) The company’s launch post also tied the model to Bailian, called Model Studio outside China, where Alibaba said the new server and model services are available. AIHub said the API would be offered through Alibaba Cloud’s Bailian platform using OpenAI-compatible and Anthropic-compatible interfaces. (alibabacloud.com) ### What should readers be careful about when reading the claims? Alibaba’s benchmark figures and endurance test are company-reported results. Neither the Alibaba Cloud post nor the secondary summaries reviewed here provide a full public methodology for the Qwen 3.7 Max benchmark runs, including the exact harness settings, budgets or pass criteria used to produce the reported scores. (alibabacloud.com) The benchmark projects themselves are public. SWE-bench publishes its repository and describes Verified as a curated solvable subset, while Terminal-Bench 2.0 is documented through its website, GitHub repository and paper. Those sources establish what the benchmarks are measuring, but they do not independently verify Alibaba’s specific Qwen 3.7 Max scores. (alibabacloud.com) ### What happens next for Qwen 3.7 Max? May 20 was the public debut at Alibaba Cloud Summit, and the next checkpoints are likely to be broader API rollout through Bailian/Model Studio and third-party reproductions in public harnesses. Qwen Code documentation is already live, and the public benchmark repositories for SWE-bench and Terminal-Bench give outside developers a path to test comparable agent setups against the same task families. (github.com) (alibabacloud.com)