DeepSeek runs V4 on Huawei chips
- DeepSeek released a preview of DeepSeek-V4 on April 24, including V4-Pro and V4-Flash, and said the new models are adapted to run on Huawei’s Ascend chips as China pushes local AI infrastructure. - Nvidia answered the same day with a Blackwell pitch: DeepSeek-V4-Pro has 1.6 trillion total parameters, 49 billion active per token, and Nvidia says Blackwell can deliver about 3,500 tokens per second. - The launch lands amid tighter U.S. chip curbs and China’s push to replace Nvidia-heavy stacks with domestic hardware and software. (reuters.com)
DeepSeek released a preview of DeepSeek-V4 on April 24 and said the new model family is adapted to run on Huawei’s Ascend chips. (reuters.com) The launch includes two models: DeepSeek-V4-Pro and DeepSeek-V4-Flash. Nvidia’s own technical blog says V4-Pro has 1.6 trillion total parameters with 49 billion active per token, while V4-Flash has 284 billion total parameters with 13 billion active. (developer.nvidia.com) Both models support a 1 million-token context window, which is the amount of text a model can keep in working memory during a session. Nvidia said that long context is aimed at coding, document analysis, retrieval and “agentic” workflows that chain multiple steps together. (developer.nvidia.com) DeepSeek’s Huawei adaptation puts the software on China’s domestic artificial-intelligence chips instead of relying only on Nvidia’s ecosystem. Reuters reported the move as another step in Beijing’s push for a more self-sufficient AI stack. (reuters.com) Nvidia moved quickly to show the model also runs on its newest hardware. In an April 24 post, the company said Blackwell has day-zero support for DeepSeek-V4 and described optimizations for the open-source serving software vLLM. (developer.nvidia.com) Wccftech, citing Nvidia materials, reported throughput of roughly 3,500 tokens per second on Blackwell for DeepSeek-V4-Pro. Nvidia’s official post emphasizes the Blackwell launch support and the model’s lower compute and memory demands, but the exact throughput figure is reported in secondary coverage. (wccftech.com) (developer.nvidia.com) Nvidia also says V4’s architecture cuts per-token inference floating-point operations by 73% and reduces key-value cache memory burden by 90% versus DeepSeek-V3.2. In plain terms, that means less compute and memory are needed to keep very long conversations and documents in play. (developer.nvidia.com) Reuters reported that DeepSeek said the Pro version beats other open-source models on its world-knowledge benchmark and trails only Google’s closed-source Gemini-Pro-3.1. That claim comes from DeepSeek, not an independent benchmark operator. (reuters.com) The timing matters because U.S. export controls have made advanced Nvidia chips harder to sell into China, pushing Chinese developers toward local substitutes. Reuters framed DeepSeek’s Huawei support as evidence that major Chinese models can now be engineered around domestic silicon. (reuters.com) So the same model is now being used to make two different arguments at once: Huawei can host a frontier Chinese model at home, and Nvidia can still claim the fastest, best-supported path for running it at scale. (reuters.com) (developer.nvidia.com)