Alibaba's New AI Models Run Video on Phones
Alibaba's new Qwen3.5 AI models are reportedly delivering state-of-the-art multimodal performance with breakthrough efficiency. The models are capable of running video tasks on consumer devices, pointing to a future of hybrid, on-device AI that could reduce server loads, latency, and costs for newsroom applications.
Alibaba's Qwen3-VL models are engineered for high-demand multimodal tasks, capable of processing and analyzing hours-long videos with full recall and second-level indexing for precise event location. This is enabled by a native 256K context window that is expandable to 1 million tokens, allowing the AI to handle extensive video footage or large documents in a single pass. The architecture behind this performance includes innovations like Interleaved-MRoPE, which uses robust positional embeddings to enhance reasoning over long video sequences, and DeepStack, which fuses features from the Vision Transformer to capture fine-grained details. For efficiency, Qwen3.5 uses a sparse Mixture-of-Experts (MoE) architecture, which means only a fraction of the model's total parameters are activated during inference, significantly boosting speed and reducing computational cost. On-device deployment of such models points to a significant shift in infrastructure strategy for video platforms. Edge AI processing can offer substantial total cost of ownership (TCO) savings over five years—with some analyses indicating up to 70% reduction in operational expenses compared to public cloud offerings by minimizing bandwidth and recurring data transfer fees. This hybrid model, where processing happens locally, also reduces latency, a critical factor for real-time video editing and analysis tools. For newsrooms, the adoption of AI is cautious but deliberate, with a focus on tools that enhance efficiency without compromising journalistic integrity. A 2025 survey of UK journalists found that AI is most commonly used for language-processing tasks like transcription and captioning (49% use it at least monthly). More advanced uses like story research (22%) and generating headlines (16%) are emerging as newsroom leaders prioritize core journalism skills over specialized AI expertise. The key purchasing driver for newsrooms is not replacing journalists, but augmenting their capabilities—automating repetitive tasks to free up time for in-depth reporting and analysis. Tools that can quickly clip highlights, repurpose live feeds, and generate summaries are seeing increased adoption. More than 70% of newsrooms now prioritize digital platforms over traditional broadcast when breaking a story, making the speed gains from AI-powered workflows a direct competitive advantage. The Qwen3.5 models also feature advanced "visual agent" capabilities, allowing them to operate PC and mobile graphical user interfaces (GUIs). The AI can recognize on-screen elements, understand their functions, and execute tasks, pointing to a future of more automated and interactive video creation and editing processes. This aligns with the growth of agentic AI in media, a market projected to expand significantly by 2030 by accelerating content creation and workflow optimization.