Microsoft adds three Azure models
What happened
Microsoft launched three foundational AI models on Azure AI, expanding the set of major model suppliers beyond the usual handful. That broadens choices for teams that want cloud‑proximate models for perception, planning, or coding tasks and can change where robotics stacks source their inference workloads. (extremetech.com)
Why it matters
Microsoft announced three new in-house models that can turn spoken language into text, synthesize highly realistic speech, and generate images, and made them available through its Foundry platform on April 2, 2026. (microsoft.ai) Because those capabilities live inside Microsoft’s cloud offering, teams can route tasks like speech recognition, voice agents, and large-scale image generation to Azure rather than only running them on local hardware inside a robot or device. (microsoft.ai) The three models have names and specific targets: MAI-Transcribe-1 is the speech-to-text system that Microsoft says handles the 25 most-used languages and delivers about 2.5× faster batch transcription than the previous Azure “Fast” option; the announcement also showed lower word error rates (the percentage of words transcribed incorrectly) compared with several competitors. (microsoft.ai) MAI-Voice-1 is the audio-generation model that can create a custom voice from a few seconds of sample audio and — according to Microsoft — can produce 60 seconds of output audio in roughly one second of wall-clock time, using much less compute per second because it is optimized for efficient use of graphics processors (GPUs, the chips commonly used to accelerate neural-network work). A “foundational” model in this context means a large, pre-trained neural network that developers can use directly or adapt to many downstream tasks. (microsoft.ai) MAI-Image-2 is billed as the second-generation image model with at least 2× faster generation on Foundry and Copilot while keeping similar visual quality; Microsoft says it has already started phased rollouts into Bing and PowerPoint after an initial debut on Microsoft’s MAI Playground. Tech press coverage notes Microsoft positions these models as lower-cost alternatives to other major providers and publishes per‑use pricing for each model family. ( ) Microsoft says the models were developed by its MAI research organization and were put through internal red‑teaming and the Foundry governance controls that provide enterprise guardrails and compliance features; the company also framed the releases as part of a broader push to ship more models into Microsoft products while maintaining its partnership with external labs. ( )
Key numbers
- (extremetech.com) Microsoft announced three new in-house models that can turn spoken language into text, synthesize highly realistic speech, and generate images, and made them available through its Foundry platform on April 2, 2026.
Sources
Quick answers
What happened in Microsoft adds three Azure models?
Microsoft launched three foundational AI models on Azure AI, expanding the set of major model suppliers beyond the usual handful. That broadens choices for teams that want cloud‑proximate models for perception, planning, or coding tasks and can change where robotics stacks source their inference workloads. (extremetech.com)
Why does Microsoft adds three Azure models matter?
Microsoft announced three new in-house models that can turn spoken language into text, synthesize highly realistic speech, and generate images, and made them available through its Foundry platform on April 2, 2026. (microsoft.ai) Because those capabilities live inside Microsoft’s cloud offering, teams can route tasks like speech recognition, voice agents, and large-scale image generation to Azure rather than only running them on local hardware inside a robot or device. (microsoft.ai) The three models have names and specific targets: MAI-Transcribe-1 is the speech-to-text system that Microsoft says handles the 25 most-used languages and delivers about 2.5× faster batch transcription than the previous Azure “Fast” option; the announcement also showed lower word error rates (the percentage of words transcribed incorrectly) compared with several competitors. (microsoft.ai) MAI-Voice-1 is the audio-generation model that can create a custom voice from a few seconds of sample audio and — according to Microsoft — can produce 60 seconds of output audio in roughly one second of wall-clock time, using much less compute per second because it is optimized for efficient use of graphics processors (GPUs, the chips commonly used to accelerate neural-network work). A “foundational” model in this context means a large, pre-trained neural network that developers can use directly or adapt to many downstream tasks. (microsoft.ai) MAI-Image-2 is billed as the second-generation image model with at least 2× faster generation on Foundry and Copilot while keeping similar visual quality; Microsoft says it has already started phased rollouts into Bing and PowerPoint after an initial debut on Microsoft’s MAI Playground. Tech press coverage notes Microsoft positions these models as lower-cost alternatives to other major providers and publishes per‑use pricing for each model family. ( ) Microsoft says the models were developed by its MAI research organization and were put through internal red‑teaming and the Foundry governance controls that provide enterprise guardrails and compliance features; the company also framed the releases as part of a broader push to ship more models into Microsoft products while maintaining its partnership with external labs. ( )