Microsoft builds its own MAI models

Microsoft has rolled out three in-house foundation models for transcription, voice and image tasks — MAI-Transcribe-1, MAI-Voice-1 and MAI-Image-2 — signalling it wants strategic optionality at the model layer rather than relying solely on external suppliers. The company is also reported to be working to bring the OpenClaw agent framework into Microsoft 365 Copilot, which points to a future where agentic workflows are stitched directly into office software rather than living in separate chat windows. These moves fit a broader platform push to own more of compute, orchestration and the models that run on them, changing how enterprises think about vendor risk and integration. ( )

Microsoft just put three of its own artificial intelligence models on sale through Microsoft Foundry on April 2: one that turns speech into text, one that generates speech, and one that makes images. That is Microsoft moving one layer deeper into the stack instead of renting every brain from outside labs. (microsoft.ai) The speech-to-text model is called MAI-Transcribe-1, and Microsoft says it works across 25 languages and runs 2.5 times faster than its existing Azure Fast transcription service. Microsoft also says it hit a 3.9% average word error rate on the FLEURS benchmark, which is lower than the numbers it lists for GPT-Transcribe, Scribe v2, Gemini 3.1 Flash, and Whisper large v3. (microsoft.ai) The speech-generation model is called MAI-Voice-1, and Microsoft says it can produce 60 seconds of audio in under one second on a single graphics processor. Microsoft is also letting developers create a custom voice from just a few seconds of audio inside Foundry, which turns the model from a demo tool into something a call center or software product could actually ship. (techcommunity.microsoft.com) The image model is called MAI-Image-2, and Microsoft says it is already being used inside Copilot, Bing, and PowerPoint. The company says image generation is at least 2 times faster in Foundry and Copilot than before, and it says WPP is one of the first enterprise customers building with it at scale. (microsoft.ai) These models were built by Microsoft’s MAI Superintelligence team, which TechCrunch says was formed in November 2025 under Microsoft AI chief Mustafa Suleyman. Microsoft is still tied to OpenAI, but it is now building direct substitutes for pieces of the product stack that used to come mostly from that partnership. (techcrunch.com) The pricing tells you who Microsoft wants to win over. TechCrunch reported MAI-Transcribe-1 starts at $0.36 per hour, MAI-Voice-1 starts at $22 per 1 million characters, and MAI-Image-2 starts at $5 per 1 million text-input tokens and $33 per 1 million image-output tokens, which puts the pitch on cost as much as on quality. (techcrunch.com) At the same time, Microsoft is pushing on a second front: software that can take actions, not just answer prompts. The newsletter The AI Economy reported on April 10 that Microsoft created a new team led by former Word chief Omar Shahine to bring the OpenClaw agent framework into Microsoft 365 Copilot. (thelettertwo.com) An agent framework is the plumbing that lets an artificial intelligence system click buttons, move files, send emails, and work across apps instead of sitting in a chat box. The same report says OpenClaw can run on a machine and automate workflows across existing software, which is why Microsoft wants it inside office tools people already live in all day. (thelettertwo.com) Visual Studio Magazine reported on April 2 that Shahine described his job as “Bringing OpenClaw + personal agents to Microsoft 365,” with Teams named as an early integration point. That suggests Microsoft is not treating agents as a separate app for enthusiasts, but as a feature that could show up inside meetings, documents, and workplace chat. (visualstudiomagazine.com) Put those two moves together and the shape is pretty clear: Microsoft wants to own the models, the developer platform, and the software where the work happens. If that plan works, a company buying Microsoft 365 Copilot would be buying not just a chatbot, but a full office system with Microsoft-made speech, voice, image, and task-running layers underneath it. (techcommunity.microsoft.com, visualstudiomagazine.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.