DeepSeek‑V4 adds million‑token context

- DeepSeek launched preview versions of DeepSeek‑V4‑Pro and V4‑Flash on April 24, open‑sourcing both models and making 1M‑token context standard across its services. - The headline number is not just 1M tokens. DeepSeek says V4‑Pro cuts per‑token long‑context compute to 27% of V3.2 and KV cache to 10%. - That matters because long‑running coding agents break on memory and cost before they break on raw IQ.

A million‑token context window sounds like a spec sheet flex. Usually it is. Models advertise giant context, then fall apart when you actually try to keep a long coding session, research trace, or multi‑file repo alive for hours. That is why DeepSeek‑V4 is interesting. The real news is not just that DeepSeek shipped 1M context on April 24. It shipped two open models — V4‑Pro and V4‑Flash — and framed the whole release around making that context cheap enough to use, not just possible to demo. ### What actually shipped? DeepSeek released DeepSeek‑V4‑Pro and DeepSeek‑V4‑Flash as preview models, with open weights and API access on the same day. Pro is a 1.6T‑parameter MoE model with 49B active parameters. Flash is 284B total with 13B active. Both support a 1M‑token context window, and DeepSeek says that 1M is now the default across its official services. (api-docs.deepseek.com) ### Why is 1M tokens a big deal? Because 1M tokens is large enough to hold things people actually care about — long books, giant contracts, sprawling chat histories, or a serious chunk of a codebase. But the catch is that “can fit” and “can use” are different. A long‑running agent keeps appending tool outputs, file contents, terminal logs, and intermediate reasoning. The context grows every step, so each next step gets slower and more memory‑hungry. (api-docs.deepseek.com) ### So what did DeepSeek change? DeepSeek’s pitch is an efficiency story. It uses a hybrid attention design — Compressed Sparse Attention plus Heavily Compressed Attention — instead of paying the usual full long‑context cost all the way through. In DeepSeek’s own numbers, V4‑Pro at 1M tokens needs 27% of the single‑token inference FLOPs of V3.2 and 10% of the KV cache. Flash goes lower still. (huggingface.co) ### Why does KV cache matter so much? KV cache is basically the model’s working memory for the tokens it has already seen. In long sessions, that memory bill explodes. If the cache gets too big, you either need much more expensive hardware or you start cutting corners — shorter histories, more summarization, more resets. DeepSeek is trying to attack exactly that bottleneck, which is why this release lands more like infrastructure than marketing. (openlm.ai) ### Is this mainly for chat, or for agents? Mostly agents. DeepSeek leans hard on coding and tool‑using workflows, and its docs already show integrations with agent tools and API features like tool calls, thinking modes, and context caching. That makes sense — ordinary chat rarely needs 1M tokens, but coding agents and browse agents absolutely can. They are the workloads that keep tripping over context limits, cache costs, and degraded performance halfway through a task. (huggingface.co) ### Does the pricing change the story? Yes — a lot. DeepSeek also cut cache‑hit pricing across models to one‑tenth of launch price, and V4‑Pro is currently listed with a 75% promotional discount through May 31, 2026. That matters because giant context without cheap reuse is a toy. If repeated prefixes can be cached cheaply, long documents and multi‑turn agent runs become much more practical. (api-docs.deepseek.com) ### Are old benchmarks enough here? Not really. Benchmarks like MMLU still tell you something about general knowledge and reasoning, and DeepSeek highlights those numbers. But million‑token models live or die on a different axis — whether they can retrieve the right detail 700,000 tokens back, survive long tool loops, and stay coherent across a huge working trace. That is a systems problem as much as a pure intelligence problem. (api-docs.deepseek.com) DeepSeek’s release is basically a bet that this is where model evaluation has to go next. ### Bottom line? DeepSeek did not just make the context window bigger. It tried to make long context operational. If that holds up outside DeepSeek’s own tests, the important shift is simple — frontier model competition stops being only about smartest answer per prompt, and starts being about who can keep a useful process alive the longest. (api-docs.deepseek.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.