DeepSeek V4 ships with long‑context sparse‑attention models and open‑weight support
- DeepSeek on April 24 released preview versions of DeepSeek-V4-Pro and DeepSeek-V4-Flash, two open-weight language models built for million-token context windows and cheaper long-running inference on large prompts. - The flagship V4-Pro has 1.6 trillion total parameters with 49 billion active per token; V4-Flash has 284 billion total and 13 billion active, and both use MIT licensing. - The launch ties open models to inference economics, with DeepSeek and NVIDIA highlighting lower FLOPs and memory use for agent workloads on Blackwell systems. (developer.nvidia.com)
DeepSeek released preview versions of DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24, pitching both as open-weight models built for million-token context windows. (developer.nvidia.com) (huggingface.co) A context window is the text a model can keep “in mind” at once, and the cost usually rises as that window gets longer. DeepSeek says V4 is designed to cut that cost so long document analysis, coding sessions and agent workflows can stay inside one running prompt. (huggingface.co) (developer.nvidia.com) The larger model, V4-Pro, has 1.6 trillion total parameters with 49 billion active per token. The smaller V4-Flash has 284 billion total parameters with 13 billion active per token, and both support 1 million tokens of context. (developer.nvidia.com) NVIDIA’s write-up says both models are released under the MIT license and can generate up to 384,000 output tokens through DeepSeek’s application programming interface. NVIDIA also positioned them for deployment on Blackwell systems and GPU-accelerated endpoints. (developer.nvidia.com) The engineering change is in attention, the part of a transformer model that decides which earlier tokens to look at. DeepSeek V4 mixes compressed sparse attention and heavily compressed attention so the model stores and checks fewer pieces of old text while it keeps working. (developer.nvidia.com) (huggingface.co) NVIDIA says those changes cut per-token inference floating-point operations by 73% and key-value cache memory by 90% versus DeepSeek-V3.2. Hugging Face’s analysis of the release says V4-Pro needs 27% of V3.2’s single-token inference FLOPs at 1 million tokens and 10% of the key-value cache memory. (developer.nvidia.com) (huggingface.co) That matters for “agent” systems, where a model keeps appending tool results, logs, retrieved documents and prior steps into one growing transcript. In those setups, the bottleneck is often not training the model but paying to keep the whole working history available during inference. (huggingface.co) (developer.nvidia.com) DeepSeek also adapted V4 to run on Huawei chips, according to Reuters, extending the model beyond NVIDIA hardware. Reuters reported Huawei said its Ascend supernode lineup supports the DeepSeek V4 series after coordination between model and chip teams. (money.usnews.com) That puts the release in two races at once: one over open models versus closed systems, and another over which hardware stack carries them. The closing pitch from DeepSeek and its partners is simple: long context is only useful if someone can afford to keep it running. (developer.nvidia.com) (money.usnews.com)