Alibaba’s stealth video AI shines
A stealth video‑generation model from Alibaba reportedly topped global benchmarks, surprising observers inside China’s AI sector. The result suggests rapid advances in synthetic video capability are underway and adds pressure to competitors working on multimodal models. (x.com)
For months, the loudest names in artificial intelligence video were OpenAI, Google, and ByteDance. Then a model called Happy Horse 1.0 appeared on the Artificial Analysis text-to-video leaderboard this week, took the top spot on debut, and Alibaba confirmed on April 10 that it built it. (cnbc.com) Text-to-video is exactly what it sounds like: you type “a red train crossing a snowy bridge at dusk,” and the system tries to turn that sentence into moving images. The hard part is not drawing one pretty frame, but keeping the train, bridge, snow, and camera motion consistent across dozens of frames in a row. (github.com) That is why benchmark tests exist. VBench, a widely used video-generation benchmark introduced by researchers behind Vchitect and OpenCompass, scores models on things like motion, subject consistency, and how well the video matches the prompt. (github.com) Alibaba was already in this race before Happy Horse showed up. Its Wan2.1 video model was released in early 2025, Alibaba said it topped the VBench leaderboard, and the company open-sourced 14 billion-parameter and 1.3 billion-parameter versions in February 2025. (alibabacloud.com) Wan matters because it lowered the hardware bar. Alibaba’s Hugging Face page says the smaller Wan2.1 text-to-video model can run with 8.19 gigabytes of video memory and generate a 5-second 480p clip on an Nvidia GeForce RTX 4090 in about 4 minutes. (huggingface.co) Under the hood, Wan is built on what researchers call a diffusion transformer. In plain English, that means the model starts with visual noise, cleans it up step by step like sharpening a blurry photo, and uses transformer blocks to keep track of what should stay the same from frame to frame. (arxiv.org) Alibaba’s own technical report says Wan improved video generation with a new variational autoencoder, large-scale data curation, and automated evaluation. A variational autoencoder is a compression system: it squeezes video into a smaller code the model can handle, then expands it back out into frames. (arxiv.org) The surprise this week was not just that Alibaba had another video model. CNBC reported that Happy Horse had climbed global rankings before its owner was disclosed, which meant people inside China’s artificial intelligence industry were reacting to the score first and the branding second. (cnbc.com) Bloomberg reported that Happy Horse 1.0 hit No. 1 on the Artificial Analysis text-to-video leaderboard and that Alibaba only claimed ownership on Friday, April 10. That kind of anonymous launch is unusual in a field where companies normally attach a giant product event to every benchmark win. (bloomberg.com) The timing adds pressure because China’s biggest internet groups are all pushing multimodal systems at once. South China Morning Post reported in February that Alibaba’s Qwen 3.5 and Zhipu’s GLM-5 were part of a domestic sprint to answer recent releases from United States firms, while another South China Morning Post report on April 2 said Alibaba had started keeping some fresh models proprietary instead of open-sourcing everything. (scmp.com 1) (scmp.com 2) That split tells you where the market is going. Alibaba can use open models like Wan2.1 to win developers, use closed models like Happy Horse to chase paid demand through Alibaba Cloud, and use leaderboard wins to argue that Chinese labs are no longer playing catch-up in synthetic video. (alibabacloud.com) (cnbc.com)