Microsoft doubles down

Microsoft is building its own MAI-branded voice, transcription and image models as part of a push to reduce reliance on third-party model vendors and own more of the AI stack. The move signals a more plural generative-AI market where platform-level models compete with specialist providers, which matters for designers who should optimise workflows for flexibility rather than loyalty to a single image model. (cloudwars.com)

Microsoft just put three in-house artificial intelligence models into Microsoft Foundry on April 2, 2026: one that turns speech into text, one that generates speech, and one that makes images. The names are MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2. (microsoft.ai) That is a bigger shift than a product launch, because Microsoft spent the last two years selling cloud access to outside models, especially from OpenAI. Now it is shipping Microsoft-branded models in categories where OpenAI, Google, and specialist labs already compete. (techcrunch.com) Microsoft Foundry is the company’s model marketplace and developer workbench, so putting MAI models there means customers can compare Microsoft’s own systems against rival systems inside the same platform. Microsoft says the new models are available now in Foundry and in a separate testing area called MAI Playground. (techcommunity.microsoft.com) The speech-to-text model is aimed at a very specific job: listening to messy real-world audio and writing it down accurately. Microsoft says MAI-Transcribe-1 covers 25 languages and runs batch transcription 2.5 times faster than the existing Azure Fast transcription offering. (microsoft.ai) The voice model is built for the other half of a spoken conversation, where software has to answer back out loud instead of just printing text on a screen. Microsoft says MAI-Voice-1 can generate one minute of audio in under one second on a single graphics processing unit, which is the kind of speed needed for live assistants and call systems. (techcommunity.microsoft.com) The image model is the visual piece of the same strategy. Microsoft says MAI-Image-2 is priced at $10 per million input tokens and $40 per million output tokens, which is a cloud billing model meant for developers generating large volumes of images inside apps and design tools. (microsoft.ai) This did not come out of nowhere. Microsoft created a new Microsoft AI unit under Mustafa Suleyman in March 2024, after hiring him and much of the Inflection AI team to build consumer products and long-term model capabilities inside the company. (news.microsoft.com) The OpenAI partnership is still real, but it is no longer Microsoft’s only engine. Reuters reported in March 2025 that Microsoft had been developing in-house reasoning models and testing alternatives from xAI, Meta, and DeepSeek as it looked for more control over cost, product design, and supply. (reuters.com) That helps explain why these first MAI launches are not chatbots. Speech transcription, voice generation, and image creation are narrower product layers than a giant general-purpose language model, so Microsoft can slot them into Copilot, Bing, PowerPoint, and enterprise tools without first replacing OpenAI everywhere. (economictimes.indiatimes.com) For developers and designers, the practical change is that the market is starting to look less like one winner and more like a shelf of interchangeable parts. If Microsoft can supply speech, voice, and image models from inside its own cloud, teams that build around flexible workflows instead of one favorite model will have more room to switch on price, speed, and quality. (cloudwars.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.