DeepSeek previews V4, 1.6T params
- DeepSeek said on April 24 it released preview versions of DeepSeek-V4-Pro and V4-Flash, open-weight language models with one-million-token context windows. - The flagship V4-Pro uses 1.6 trillion total parameters with 49 billion active, while DeepSeek says 1M-token inference needs 27% of V3.2 FLOPs. - The launch shifts DeepSeek from V3.2 to agent-focused long-context models and starts retiring older chat endpoints in July. (api-docs.deepseek.com)
A language model works by predicting the next token from everything already in its prompt, and that gets expensive as the prompt gets longer. DeepSeek’s new V4 preview is built around making very long prompts cheaper to run. (huggingface.co) (api-docs.deepseek.com) DeepSeek said on April 24 that it released two open-weight preview models: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Both support a 1,000,000-token context window, which is large enough to hold long codebases, tool traces, or document collections in a single session. (api-docs.deepseek.com) (huggingface.co) The larger model, V4-Pro, is a mixture-of-experts system with 1.6 trillion total parameters and 49 billion active at a time. V4-Flash is smaller at 284 billion total parameters and 13 billion active. (api-docs.deepseek.com) (github.com) Mixture-of-experts models split work across specialized subnetworks instead of using every parameter on every token. That design lets companies advertise very large total parameter counts while keeping the active compute closer to a much smaller model. (github.com) (huggingface.co) The main engineering claim is about attention, the mechanism that lets a model look back across prior tokens. DeepSeek said V4 combines Compressed Sparse Attention and Heavily Compressed Attention, then interleaves them across layers to cut the cost of million-token inference. (github.com) (huggingface.co) At a 1 million token context, DeepSeek said V4-Pro needs 27% of the single-token inference floating-point operations of DeepSeek-V3.2 and 10% of the key-value cache memory. Hugging Face’s write-up said V4-Flash pushes those figures lower, to 10% of the FLOPs and 7% of the cache. (github.com) (huggingface.co) Key-value cache is the running memory a model stores so it does not recompute every earlier token from scratch. For long-running coding or browsing agents, that cache often becomes the practical bottleneck before the model reaches its advertised context limit. (huggingface.co) DeepSeek also said it trained the models with the Muon optimizer and pre-trained them on more than 32 trillion tokens before post-training. The company described a two-stage post-training process that first builds domain-specific experts, then consolidates them with on-policy distillation. (github.com) (huggingface.co) The company is pitching V4 as an agent model, not just a chatbot. Its announcement said V4 is integrated with tools including Claude Code, OpenClaw and OpenCode, and Ollama lists cloud-hosted V4-Pro and V4-Flash variants with three reasoning modes. (api-docs.deepseek.com) (ollama.com 1) (ollama.com 2) DeepSeek’s own benchmark tables show V4-Pro-Max scoring 83.5 on MRCR 1M, 62.0 on CorpusQA 1M, 67.9 on Terminal Bench 2.0, and 80.6 on SWE Verified. Those are company-published results, but they line up with the product’s focus on long-context and software-agent workloads. (ollama.com) The release also starts a platform transition. DeepSeek said developers can keep the same base URL and switch model names to deepseek-v4-pro or deepseek-v4-flash, while deepseek-chat and deepseek-reasoner will be retired on July 24, 2026 at 15:59 UTC. (api-docs.deepseek.com) That makes the V4 preview less a one-off model drop than a reset of DeepSeek’s public stack around million-token, agent-heavy use. The company opened the weights on Hugging Face and exposed the new models through its API and Ollama at the same time. (api-docs.deepseek.com) (huggingface.co) (ollama.com)