Alibaba Releases 397B Multimodal Agent Model
Alibaba has released Qwen3.5 VLM, a 397B parameter native multimodal model that processes text, vision, audio, and video. The model is available via NVIDIA GPU-accelerated endpoints, designed for production-grade agentic applications like multimodal RAG, document understanding, and video Q&A.
The model's architecture combines a sparse Mixture-of-Experts (MoE) design with Gated Delta Networks. While it has 397 billion total parameters, only 17 billion are active during inference for any given token, a key factor for its efficiency. The MoE structure utilizes 512 experts in total, routing 11 of them for each token. This architecture delivers a significant performance boost, with decoding throughput that is 8.6 to 19 times faster than its predecessor, Qwen3-Max. Unlike previous Qwen vision models that required separate adapters, Qwen3.5 is a natively multimodal model, using an early-fusion technique on its training data. It features a 256K token context window that can be extended up to 1 million tokens. This large context capacity allows the model to natively process and reason over approximately two hours of video content without pre-processing. On several academic benchmarks, Qwen3.5-397B shows strong performance, outscoring OpenAI's GPT-4o on tasks like GPQA (Graduate-Level Science Q&A), MMLU-Pro, and the SWE-Bench for software engineering. It is released under an Apache 2.0 license, with weights available on Hugging Face and ModelScope. For fine-tuning and customization, the model is supported by the NVIDIA NeMo framework, which enables methods like LoRA and full supervised fine-tuning (SFT). For production deployment, it's available as a containerized microservice through NVIDIA NIM, which can be run on-premises or in the cloud and is compatible with serving frameworks like vLLM. The model is specifically optimized for agentic workflows, demonstrating capabilities in navigating both mobile and web user interfaces. To support this, Alibaba has also released open-source frameworks like Qwen-Agent and Qwen Code to help developers build applications that utilize the model's tool use and planning abilities.