Microsoft Open-Sources Multimodal AI

Microsoft has released Phi-4-reasoning-vision-15B, a new open-source model capable of processing text, images, and scientific charts. The hardware-efficient model is designed for complex analysis in research and enterprise settings, reflecting a broader trend of democratizing advanced AI tools beyond closed, proprietary systems.

A standout feature is the model's ability to selectively engage in deep reasoning. For complex problems, it can invoke a "chain-of-thought" process using `<think>` tags, but for simpler perception tasks, it defaults to a faster, direct inference mode to conserve resources. The model's architecture combines two existing components: the Phi-4 Reasoning language model and the SigLIP-2 vision encoder. It uses a "mid-fusion" technique where only some layers handle multimodal data, a design choice that significantly reduces hardware demand compared to fully multimodal systems. Training was remarkably efficient, completed in just four days on 240 NVIDIA B200 GPUs. Microsoft focused on high-quality, curated data, even using GPT-4o to correct and generate new captions for images in open-source datasets, a departure from the massive, unfiltered data used for many larger models. On specific benchmarks, Phi-4-reasoning-vision-15B punches above its weight. In a multimodal mathematics test called MathVista_Mini, it scored 17 percent higher than Google's gemma-3-12b-it. However, Microsoft's own data shows its performance is mixed, outperforming larger models in some areas while lagging in others. Its design is particularly well-suited for building AI agents that can interact with software. The model can analyze screenshots to understand and ground user interface elements like buttons and menus, making it a strong foundation for models that navigate desktop and mobile applications. This release is part of Microsoft's broader strategy with its "Phi" family of models, which challenge the idea that cutting-edge reasoning requires enormous parameter counts. Other specialized variants include Phi-4-reasoning-plus, which uses reinforcement learning to boost accuracy on complex tasks.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.