DeepSeek launches V4 models
- DeepSeek launched its DeepSeek-V4 preview on April 24, releasing two open-weight MoE models — V4-Pro and V4-Flash — across chat and API. - The big number is 1 million tokens by default: V4-Pro uses 49B active parameters, V4-Flash 13B, with aggressive API pricing. - This matters because open models are pushing from demos into real agent work — long coding, document review, and tool-heavy automation.
DeepSeek just made a very specific bet about where AI is going next. Not just “bigger models,” and not just “cheaper chatbots.” The bet is that the next useful wave is long-running agents — models that read huge amounts of material, call tools over and over, and keep going without falling apart. On April 24, 2026, DeepSeek launched a V4 preview with two open-weight models, DeepSeek-V4-Pro and DeepSeek-V4-Flash, both built around a 1 million-token context window and both available through chat, API, and downloadable weights. (api-docs.deepseek.com) ### What actually launched? DeepSeek shipped two mixture-of-experts models. V4-Pro is the larger flagship at 1.6 trillion total parameters with 49 billion active at a time. V4-Flash is the smaller, cheaper sibling at 284 billion total and 13 billion active. Both support 1 million tokens of context, and both come in open-weight form rather than staying locked behind an API. (api-docs.deepseek.com) ### Why is 1 million tokens the headline? Because context length is the thing that breaks a lot of “agent” demos in real use. A model can look smart for a few turns, but once you keep appending tool outputs, logs, files, and intermediate reasoning, the context window becomes the bottleneck. DeepSeek’s pitch is that 1 million tokens is not a stunt (api-docs.deepseek.com)ompany wants developers to treat giant context as normal, not premium. (api-docs.deepseek.com) ### Why is that hard to do cheaply? The catch is memory and compute. Long context is expensive because every new token has to pay attention to a mountain of earlier tokens. DeepSeek says V4-Pro cuts single-token inference FLOPs to 27% of DeepSeek-V3.2 at the 1M-token setting and cuts KV-cache use to 10%. V4-Flash goes further, down to 10% of the F(api-docs.deepseek.com)iming it made million-token inference practical enough to use, not just benchmark. (huggingface.co) ### What changed in the model design? The architecture uses a hybrid attention setup. One part compresses and sparsifies attention so the model does not keep the full cost of every earlier token. Another part keeps enough dense information around to avoid turning long context into mush. You can think of it like replacing a giant unfiltered transcrip(huggingface.co)That is the technical move underneath the launch. (huggingface.co) ### Why is DeepSeek talking so much about agents? Because V4 is being framed less as a chatbot and more as infrastructure for tool use. DeepSeek says the model is already integrated with agent frameworks like Claude Code, OpenClaw, and OpenCode, and says V4-Pro targets agentic coding in particular. That matters because long software tasks, research (huggingface.co)nd inference cost collide. (api-docs.deepseek.com) ### Is this also a pricing story? Very much. DeepSeek’s API page lists V4-Flash at $0.14 per million input tokens on cache miss and $0.28 per million output tokens. V4-Pro is pricier, but still unusually low for a model pitched near the top tier, with a temporary discounted rate of $0.435 input and $0.87 output per million tokens through May 31, 2(api-docs.deepseek.com)l.” It is saying “here is one cheap enough to change deployment math.” (api-docs.deepseek.com) ### Are the old DeepSeek models going away? Yes — at least the old names are. DeepSeek says `deepseek-chat` and `deepseek-reasoner` will be fully retired on July 24, 2026, and for now they route to V4-Flash in non-thinking and thinking modes. That tells you V4 is not a side experiment. It is the new default line. (api-docs.deepseek.com)nch matters because it moves the open-model conversation away from “can it match a benchmark screenshot?” and toward “can a company actually run serious workloads on it?” DeepSeek is arguing yes. The open weights help, the pricing helps, and the million-token default helps. But the real test is still boring and prac(api-docs.deepseek.com)avy work without reliability falling apart halfway through. (api-docs.deepseek.com)