Alibaba’s multimodal leap

Alibaba’s Qwen 3.5 Omni is being touted as a native multimodal model that handles text, images, audio, and video in one pipeline — reviewers say it can do zero‑shot code from speech+video. (x.com) Early comparisons in the thread claim it outperformed Google’s Gemini 3.1 Pro on audio tasks and signals faster cross‑modal capabilities for products in autonomous systems and video analytics. (x.com)

Alibaba publicly released Qwen3.5‑Omni on March 30, 2026 as the latest model in the Qwen3.5 line. (aihola.com) The release comes in three variants—Plus, Flash, and Light—and the team advertises a maximum context window of 256,000 tokens, which the company equates to more than 10 hours of audio or roughly 400 seconds of 720p video at 1 fps. (qwen.ai) The flagship open‑weight model is billed as Qwen3.5‑397B‑A17B, a sparse Mixture‑of‑Experts (MoE) design with 397 billion parameters and about 17 billion active parameters per token to cut activation memory and improve inference efficiency. (qwen.ai) Alibaba’s technical notes and launch materials state the Qwen3.5 family achieved 215 state‑of‑the‑art results across audio and audio‑visual subtasks and expanded language coverage for the series to 201 languages and dialects. (alibabagroup.com) The new Omni build includes speech features such as 113‑language speech processing and native voice‑cloning plus real‑time speech generation capabilities, according to early launch writeups. (apidog.com) Alibaba published the open weights and code repositories and is offering access through Alibaba Cloud Model Studio, which it says supports OpenAI‑ and Anthropic‑compatible API specifications for developer integration. (github.com)

Alibaba’s multimodal leap

Get your own daily briefing