Side-by-side tests find DeepSeek V4 beats Qwen 3.6, Kimi K2.6 and GLM

- YouTube creators spent the last week pitting DeepSeek V4 against Qwen 3.6, Kimi K2.6, GLM 5.1 and MiMo 2.5 Pro on real coding tasks. - The clearest pattern was practical, not benchmark-y: DeepSeek V4 kept winning browser macOS-clone and workflow tests, while also undercutting rivals on price. - That matters because open-model buyers are shifting from leaderboard prestige toward reliability on long agent runs and cost per shipped task.

Open models are having a very specific kind of moment. Not “who won a benchmark” — more “which model can actually finish the job when you hand it a messy developer task.” That is why these side-by-side DeepSeek V4 tests landed. In the past week, multiple creators ran DeepSeek V4 against Qwen 3.6, Kimi K2.6, GLM 5.1, MiMo 2.5 Pro, and others on browser UI cloning and coding workflows, and DeepSeek kept coming out looking like the most dependable option in the group. (youtube.com) ### What were people actually testing? The headline test was simple on purpose: give every model the same prompt — build a browser-based macOS clone — then compare the result you can see and use, not just a score in a chart. The KGP Talkie video published April 28 framed it exactly that way, with DeepSeek V4, Qwen 3.6+, GLM 5.1, Kimi K2.6, and MiMo 2.5 Pro all attempting the same UI task. Anothe(youtube.com)nd multilingual tasks, but still kept the format hands-on. (youtube.com) ### Why does that kind of test matter? Because agentic coding fails in boring ways. A model starts strong, then loses the thread, burns tokens, or breaks once the task gets long and tool-heavy. DeepSeek V4 was built with that exact problem in mind. Its pitch is not just “bigger model” — it is “a million-token context that agents can actually use,” with much lower compute and memory cost at long context than DeepSeek V3.2. (huggingface.co) ### What is DeepSeek V4, exactly? There are two preview checkpoints. DeepSeek-V4-Pro is a 1.6T-parameter MoE model with 49B active parameters, and DeepSeek-V4-Flash is 284B total with 13B active. Both ship with a 1M-token context window. Pro is also enormous by open-weight standards — larger than Kimi K2.6 at 1.1T and GLM-5.1 at 754B. That scale alone is notable, but the more important part is tha(huggingface.co)long traces. (huggingface.co) ### Why did it look better in these demos? Turns out UI-clone and coding tests reward steadiness more than flash. The YouTube comparisons describe DeepSeek as the model that most consistently held structure, produced cleaner output, and stayed coherent through the full task, while rivals often had isolated strengths — token efficiency for MiMo, decent coding from Kimi, broad competitiveness from Q(huggingface.co)lity in the showcased workflows. That is still anecdotal, not lab science, but it is exactly the kind of anecdote developers care about when picking a daily driver. (youtube.com) ### Is price part of the story? Very much. DeepSeek V4 is cheap enough to change the conversation. Flash is listed at $0.14 per million input tokens and $0.28 output. Pro is $1.74 input and $3.48 output per million tokens. In practice, that puts Flash below several “small” frontier options and Pro below many larger premium models. So if DeepSeek is also winning real workflow tests, the value story gets hard to ignore. (simonwillison.net) ### Does this mean benchmarks stopped mattering? Not really — but they matter less than they used to. Even DeepSeek’s own writeups admit the benchmark picture is competitive rather than runaway best-in-class. The shift is that buyers now care more about whether a model survives a long agent loop, handles a messy repo, or produces a usable front end without babysitting. That is why p(simonwillison.net)ose the failure modes people actually pay for. (huggingface.co) ### So what changed this week? Basically, DeepSeek V4 moved from “interesting new open model” to “the one people are stress-testing first.” The new demos do not prove a universal ranking. But they do show a pattern: when the task is practical, long, and visible, DeepSeek V4 is beating Qwen 3.6, Kimi K2.6, and GLM often enough that developers are treating it as the model to beat. (youtube.com)e The story is not that DeepSeek won the internet in one week. It is that open-model competition is getting judged on shipped work now — and in that more useful contest, DeepSeek V4 looks like the current favorite.

Side-by-side tests find DeepSeek V4 beats Qwen 3.6, Kimi K2.6 and GLM

Get your own daily briefing