Microsoft ships three MAI models

Published by The Daily Scout

What happened

Microsoft introduced three in‑house models — MAI‑Transcribe‑1, MAI‑Voice‑1 and MAI‑Image‑2 — and is productising them through Azure Foundry and a public playground, positioning the company to compete directly with OpenAI and Google. The move is framed as cheaper, enterprise-ready alternatives rather than frontier-only research, with engineers now likely to design systems that compose specialised models behind stable interfaces. ((venturebeat.com))

Why it matters

Microsoft announced on April 2, 2026 that three in-house models are now available to developers through Microsoft Foundry and the public MAI Playground: MAI‑Transcribe‑1 for speech-to-text across the top 25 languages, MAI‑Voice‑1 for synthetic voice, and MAI‑Image‑2 for text-to-image generation — the image model already appears in the top three of a public image-model leaderboard. ( ) Microsoft published concrete list prices and early commercial customers: transcription starts at $0.36 per hour of audio, voice generation starts at $22 per 1 million characters, and image generation is listed at $5 per 1M text‑input tokens and $33 per 1M image‑output tokens, with WPP named as an early enterprise customer using MAI‑Image‑2 in production. ( ) On benchmarks and throughput: Microsoft reports MAI‑Transcribe‑1 posts roughly a 3.9% average word‑error‑rate — where word error rate is the percentage of transcription words that are incorrect — on the multilingual FLEURS evaluation set, and that batch transcription runs about 2.5× faster than its prior Azure “Fast” offering. ( ) On generation efficiency and features: Microsoft says MAI‑Voice‑1 can generate 60 seconds of audio in under one second on a single graphics processor — a specialized chip used to run large AI models — and the product surface includes a Personal Voice/cloning pathway that can create a custom voice from just a few seconds of sample audio, documented in the model’s technical card. ( ) MAI‑Image‑2 debuted in the MAI Playground in mid‑March and Microsoft reports at‑scale telemetry showing roughly 2× faster generation than the prior image generator; the company is rolling the model into Bing, PowerPoint and Copilot in phased releases and is offering API access to select customers while broader Foundry access opens. ( ) The models were built and released by Microsoft’s MAI organization under Mustafa Suleyman and Microsoft positions them as first‑party production models powering Copilot and other Microsoft products, a sign the company is moving more production AI features onto its own model stack and making those same models directly available to enterprise builders via Foundry. ( )

Key numbers

  • Microsoft introduced three in‑house models — MAI‑Transcribe‑1, MAI‑Voice‑1 and MAI‑Image‑2 — and is productising them through Azure Foundry and a public playground, positioning the company to compete directly with OpenAI and Google.

Quick answers

What happened in Microsoft ships three MAI models?

Microsoft introduced three in‑house models — MAI‑Transcribe‑1, MAI‑Voice‑1 and MAI‑Image‑2 — and is productising them through Azure Foundry and a public playground, positioning the company to compete directly with OpenAI and Google. The move is framed as cheaper, enterprise-ready alternatives rather than frontier-only research, with engineers now likely to design systems that compose specialised models behind stable interfaces. ((venturebeat.com))

Why does Microsoft ships three MAI models matter?

Microsoft announced on April 2, 2026 that three in-house models are now available to developers through Microsoft Foundry and the public MAI Playground: MAI‑Transcribe‑1 for speech-to-text across the top 25 languages, MAI‑Voice‑1 for synthetic voice, and MAI‑Image‑2 for text-to-image generation — the image model already appears in the top three of a public image-model leaderboard. ( ) Microsoft published concrete list prices and early commercial customers: transcription starts at $0.36 per hour of audio, voice generation starts at $22 per 1 million characters, and image generation is listed at $5 per 1M text‑input tokens and $33 per 1M image‑output tokens, with WPP named as an early enterprise customer using MAI‑Image‑2 in production. ( ) On benchmarks and throughput: Microsoft reports MAI‑Transcribe‑1 posts roughly a 3.9% average word‑error‑rate — where word error rate is the percentage of transcription words that are incorrect — on the multilingual FLEURS evaluation set, and that batch transcription runs about 2.5× faster than its prior Azure “Fast” offering. ( ) On generation efficiency and features: Microsoft says MAI‑Voice‑1 can generate 60 seconds of audio in under one second on a single graphics processor — a specialized chip used to run large AI models — and the product surface includes a Personal Voice/cloning pathway that can create a custom voice from just a few seconds of sample audio, documented in the model’s technical card. ( ) MAI‑Image‑2 debuted in the MAI Playground in mid‑March and Microsoft reports at‑scale telemetry showing roughly 2× faster generation than the prior image generator; the company is rolling the model into Bing, PowerPoint and Copilot in phased releases and is offering API access to select customers while broader Foundry access opens. ( ) The models were built and released by Microsoft’s MAI organization under Mustafa Suleyman and Microsoft positions them as first‑party production models powering Copilot and other Microsoft products, a sign the company is moving more production AI features onto its own model stack and making those same models directly available to enterprise builders via Foundry. ( )

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Published by The Daily Scout - Be the smartest in the room.