Kimi K2.6 tops GPT-5.4 benchmarks

- Moonshot AI released Kimi K2.6 on April 21, saying its new open-source multimodal model beats GPT-5.4 on several coding and agent benchmarks. - Moonshot’s posted table shows K2.6 at 58.6 on SWE-Bench Pro versus GPT-5.4 at 57.7, with 300-agent orchestration and 4,000 steps. - Cloudflare added K2.6 to Workers AI on April 20, putting Moonshot’s model into mainstream developer tooling. (developers.cloudflare.com)

Artificial intelligence coding models are judged by whether they can finish real software tasks, not just answer questions. Moonshot AI says its new Kimi K2.6 model now edges GPT-5.4 on several of those tests. (forum.moonshot.ai) (huggingface.co) Moonshot announced Kimi K2.6 on April 21 and described it as an open-source, native multimodal agentic model. The company says it is built for long coding sessions, tool use, image input, and multi-step autonomous work. (forum.moonshot.ai) (moonshot.ai) The architecture is a mixture-of-experts system, a design that routes each token through part of a much larger network instead of all of it. Moonshot says K2.6 has 1 trillion total parameters, 32 billion active parameters per token, and a context window of about 262,000 tokens on partner platforms. (huggingface.co) (developers.cloudflare.com) On Moonshot’s own benchmark table, K2.6 scores 58.6 on SWE-Bench Pro, compared with 57.7 for GPT-5.4 and 53.4 for Claude Opus 4.6. It also posts 66.7 on Terminal-Bench 2.0 versus 65.4 for GPT-5.4, while trailing GPT-5.4 on some other tests such as Toolathlon and OSWorld-Verified. (huggingface.co) Those numbers matter because SWE-Bench Pro and Terminal-Bench try to measure whether a model can actually repair code, use tools, and complete developer workflows. They are closer to a software job ticket than a standard chatbot prompt. (huggingface.co) (forum.moonshot.ai) Moonshot is also pushing endurance as part of the release. The company says K2.6 can sustain more than 4,000 tool calls, run for more than 12 hours, and coordinate up to 300 sub-agents on a single task. (forum.moonshot.ai) (developers.cloudflare.com) That pitch is aimed at a newer class of AI products that act more like junior operators than chatbots. Instead of producing one answer, these systems search the web, write code, call tools, inspect files, and keep going across many turns. (platform.moonshot.ai) (developers.cloudflare.com) Moonshot says K2.6 is open-source, and the company has published a model card on Hugging Face while listing the model on its own API platform. Ollama also added a cloud-hosted K2.6 entry last week, extending distribution beyond Moonshot’s own stack. (huggingface.co) (platform.moonshot.ai) (ollama.com) Cloudflare said on April 20 that K2.6 was available on Workers AI with day-zero support. Its changelog repeats Moonshot’s positioning: long-horizon coding, vision, a 262.1K context window, and 300-agent swarm orchestration. (developers.cloudflare.com) The release does not show K2.6 winning everything. Moonshot’s own table still has GPT-5.4 ahead on Toolathlon, OSWorld-Verified, and several reasoning and vision benchmarks, which leaves K2.6 looking strongest in coding-heavy, tool-using workloads rather than across the board. (huggingface.co) The immediate test is whether developers trust Moonshot’s posted scores enough to run real agents on it. K2.6 is now in the places where that decision gets made: Hugging Face, Cloudflare Workers AI, Moonshot’s API, and Ollama. (huggingface.co) (developers.cloudflare.com) (platform.moonshot.ai) (ollama.com)

Kimi K2.6 tops GPT-5.4 benchmarks

Get your own daily briefing