DeepSeek V4 boosts inference speed
- DeepSeek released DeepSeek-V4 on April 24, adding DeepSeek-V4-Flash and DeepSeek-V4-Pro to its API, while LMSYS shipped day-one SGLang and Miles support for inference and reinforcement-learning training. - The new models support 1 million-token context windows and up to 384,000 output tokens, while SGLang says DeepSeek-V4-Pro uses about 27% of V3.2’s per-token inference FLOPs at 1M context. - DeepSeek also replaced its older API aliases with V4-era models and is discounting V4 Pro through May 5, widening access to long-context serving. (api-docs.deepseek.com)
Large language models read text in chunks called tokens, and long context means keeping far more of those chunks available during an answer. DeepSeek’s new V4 models push that window to 1 million tokens. (docs.sglang.io) (api-docs.deepseek.com) DeepSeek released DeepSeek-V4 on April 24 with two variants: DeepSeek-V4-Flash and DeepSeek-V4-Pro. The company’s API docs list both models as live now in OpenAI-compatible and Anthropic-compatible formats. (docs.sglang.io) (api-docs.deepseek.com) LMSYS followed on April 25 with what it called day-zero support for DeepSeek-V4 in SGLang for serving and Miles for reinforcement-learning training. Its post said the stack was built around DeepSeek-V4’s hybrid sparse-attention design, manifold-constrained hyper-connections, and FP4 expert weights. (lmsys.org) Attention is the mechanism that decides which earlier words a model should keep looking at, and compression is the shortcut that stores less of that history. SGLang’s DeepSeek-V4 guide says the Pro model uses about 27% of DeepSeek-V3.2’s per-token inference FLOPs and about 10% of its key-value cache at a 1 million-token context length. (docs.sglang.io) The serving side is where most of the speed story sits. LMSYS listed ShadowRadix prefix caching, Flash Compressor, speculative decoding, Lightning TopK, and hierarchical multi-stream overlap among the new inference optimizations for DeepSeek-V4. (lmsys.org) DeepSeek’s own API docs show why developers care about those changes. Both V4 models support a 1 million-token context window and a maximum output of 384,000 tokens, while keeping tool calls and JavaScript Object Notation output enabled. (api-docs.deepseek.com) The company is also shifting its product lineup around V4. DeepSeek says the older `deepseek-chat` and `deepseek-reasoner` names will be deprecated on July 24, 2026, and currently map to non-thinking and thinking modes of DeepSeek-V4-Flash. (api-docs.deepseek.com) Price is part of the launch. DeepSeek lists DeepSeek-V4-Pro at $1.74 per 1 million input tokens on cache miss and $3.48 per 1 million output tokens, with a limited-time 75% discount through May 5, 2026 at 15:59 Coordinated Universal Time. (api-docs.deepseek.com) The hardware targets show who this release is for. SGLang’s deployment guide says Flash can run on four B200, GB300, or H200 graphics processing units, while Pro is aimed at larger setups including eight B200s or sixteen H200s across two nodes. (docs.sglang.io) The result is less about one benchmark chart than about moving a new model into production on launch week. DeepSeek shipped the model, and the open-source serving stack around it arrived a day later. (api-docs.deepseek.com) (lmsys.org)