GLM-5.1 tops coding benchmarks

Z.ai released GLM-5.1, an open-source model that hit #1 among open models and #3 overall on coding benchmarks like SWE-Bench Pro, and the team says it’s optimised for long-horizon coding tasks at much lower cost. The model is being promoted as cheaper to run and usable in autonomous coding workflows, which could make it attractive for experimentation in student projects that need local, budget-friendly LLMs. That gives developers a performant open model to test agentic coding and reproducibility without depending on paid APIs. (x.com)

GLM-5.1 tops coding benchmarks A new open model just landed near the top of one of the hardest software tests in artificial intelligence. Z.ai says its new GLM-5.1 model ranks first among open models and third overall on SWE-Bench Pro, a benchmark built around fixing real GitHub issues in codebases that already exist. (z.ai, huggingface.co) That matters because coding benchmarks have been moving away from short puzzle questions and toward messier tasks that look more like actual engineering work. SWE-Bench measures whether a model can resolve real software bugs from open-source repositories, and the Verified version uses a human-filtered subset of 500 instances to keep comparisons cleaner. (swebench.com) GLM-5.1 is aimed at a harder category still: long-horizon coding. In plain terms, that means work that does not end after one code suggestion, but instead requires planning, editing files, running tests, reading failures, changing strategy, and repeating the cycle many times. (z.ai, huggingface.co) That is where many language models stall. Z.ai says earlier systems often make a quick first improvement, then flatten out, while GLM-5.1 is tuned to keep improving across hundreds of rounds and thousands of tool calls instead of burning through its best ideas at the start. (z.ai, huggingface.co) Under the hood, GLM-5.1 builds on the larger GLM-5 family that Z.ai released with a 744 billion parameter mixture-of-experts design, with 40 billion active parameters at a time. The company’s technical report says that architecture was built for “agentic engineering,” meaning software work where the model acts more like a persistent assistant than a one-shot chatbot. (github.com, arxiv.org) Z.ai also says the family uses DeepSeek Sparse Attention, a method intended to preserve long context while reducing training and inference cost. That cost angle is central to the GLM-5.1 pitch, because a model that is strong at long tasks but expensive to run loses much of its appeal for students, researchers, and small teams. (github.com, arxiv.org) On Z.ai’s published benchmark table, GLM-5.1 posts 58.4 on SWE-Bench Pro, ahead of GPT-5.4 at 57.7, Claude Opus 4.6 at 57.3, and Gemini 3.1 Pro at 54.2. The same table shows gains over the earlier GLM-5 model on NL2Repo, a repository-generation benchmark, and on Terminal-Bench 2.0, which tests real terminal tasks. (huggingface.co, z.ai) Independent leaderboard trackers appear to support the broad ranking claim, though exact placements vary by source and update timing. BenchLM listed GLM-5.1 at 58.4 on SWE-Bench Pro on April 7, 2026, with Claude Mythos Preview above it overall and GPT-5.4 just below it, which matches the “number one open model, number three overall” framing in Z.ai’s announcement. (benchlm.ai, z.ai) Z.ai is not only selling benchmark scores. The company says GLM-5.1 can stay on one coding task for hours, revising plans through repeated experiments, and Computerworld reported that Z.ai described the model as able to improve over hundreds of iterations in autonomous coding workflows. (computerworld.com, z.ai) For developers, the open release may be the bigger story than the score itself. The model card on Hugging Face lists local serving support through SGLang, vLLM, xLLM, Transformers, and KTransformers, which means people can test it in their own stacks instead of relying entirely on a paid hosted application programming interface. (huggingface.co) That creates a different kind of opportunity for classrooms and small labs. A student team building a coding agent can swap prompts, tools, and evaluation harnesses around the same open model, rerun experiments, and share setups that other people can reproduce without needing the exact same commercial account or pricing tier. (huggingface.co, github.com) There are still reasons to be careful with the headline. Some of the strongest numbers come from the vendor’s own published table, benchmark rankings can shift within days, and software-agent performance often depends on the harness, tool budget, and reasoning settings wrapped around the base model. (swebench.com, huggingface.co, benchlm.ai) Even with those caveats, GLM-5.1 looks like a real milestone for open coding models as of April 2026. If you wanted a budget-friendlier model for local experiments in autonomous coding, reproducible agent research, or student projects that cannot live on expensive application programming interfaces, the shortlist just changed. (z.ai, huggingface.co, computerworld.com)

GLM-5.1 tops coding benchmarks

Get your own daily briefing