Google's multimodal push

Published April 23, 2026 by The Daily Scout

- Google rolled out production-grade multimodal tooling and agent features at Cloud Next, targeting enterprise developers and creators. - Gemini Embedding 2 became generally available, and Google unveiled Workspace Intelligence plus an agent designer. - Google also announced new eighth-generation TPUs for inference and emphasised agentic retrieve-reason-act workflows in its Cloud pitch ( ).

Why it matters

Google used Cloud Next on April 22 to push a simple message: its cloud business now revolves around AI that can search, reason across media and take actions for companies. (blog.google) A multimodal model turns text, images, audio, video and documents into the same machine-readable format, so a system can match a spoken clip to a slide deck or a product photo to a manual. Google said Gemini Embedding 2, its first natively multimodal embedding model, became generally available on April 22 through the Gemini API and Vertex AI. (blog.google) Google first introduced Gemini Embedding 2 in public preview on March 10 and said it maps text, images, video, audio and documents into one shared embedding space across more than 100 languages. In practice, that lets developers build search and classification systems that work across different kinds of files instead of just text. (blog.google) At the same event, Google introduced the Gemini Enterprise Agent Platform, a set of tools for building and managing AI agents inside business software. Google’s Cloud Next roundup said the platform includes Workspace Intelligence and a no-code Agent Designer aimed at employees who are not traditional developers. (blog.google) Google has been moving toward this “agentic” pitch for more than a year: models do not just answer prompts, but retrieve information, reason through steps and then act in software. In December 2024, Google introduced Gemini 2.0 as “a new AI model for the agentic era,” tying multimodal output and tool use to that broader strategy. (blog.google) The hardware announcement matched that software pitch. Google said its eighth-generation Tensor Processing Units, TPU 8i and TPU 8t, split the work between fast inference for responsive agents and large-scale training for building more complex models. (blog.google) Google said TPU 8i is designed for low-latency inference, the part of AI computing that happens when a deployed model answers a request in real time. TPU 8t is built for training and can run larger models on a single memory pool, with general availability planned later in 2026. (blog.google) The company framed the announcements as a direct enterprise offering, not just a research demo. CNET reported that Google paired the new chips with updates to the software stack businesses use to run AI, as Google tries to sell customers on a full system rather than a standalone model. (cnet.com) That matters in a cloud market where companies are comparing model quality, chip access, security controls and the cost of running AI at scale. Google said at Cloud Next that the roadmap now runs from Gemini models to Vertex AI tools to custom silicon, with agents as the workload tying those layers together. (blog.google) The thread running through all of it was less about a single chatbot than about software that can work across file types and then do something useful with the result. Google spent Cloud Next arguing that enterprise AI is moving from prompt-and-response toward retrieve-reason-act systems it can host on its own stack. (blog.google)

Key numbers

Gemini Embedding 2 became generally available, and Google unveiled Workspace Intelligence plus an agent designer.
Google used Cloud Next on April 22 to push a simple message: its cloud business now revolves around AI that can search, reason across media and take actions for companies.
Google said Gemini Embedding 2, its first natively multimodal embedding model, became generally available on April 22 through the Gemini API and Vertex AI.
(blog.google) Google first introduced Gemini Embedding 2 in public preview on March 10 and said it maps text, images, video, audio and documents into one shared embedding space across more than 100 languages.

What happens next

Google used Cloud Next on April 22 to push a simple message: its cloud business now revolves around AI that can search, reason across media and take actions for companies.
Google’s Cloud Next roundup said the platform includes Workspace Intelligence and a no-code Agent Designer aimed at employees who are not traditional developers.
Google said at Cloud Next that the roadmap now runs from Gemini models to Vertex AI tools to custom silicon, with agents as the workload tying those layers together.

Sources

Quick answers

What happened in Google's multimodal push?

Google rolled out production-grade multimodal tooling and agent features at Cloud Next, targeting enterprise developers and creators. Gemini Embedding 2 became generally available, and Google unveiled Workspace Intelligence plus an agent designer. Google also announced new eighth-generation TPUs for inference and emphasised agentic retrieve-reason-act workflows in its Cloud pitch ( ).

Why does Google's multimodal push matter?

Google used Cloud Next on April 22 to push a simple message: its cloud business now revolves around AI that can search, reason across media and take actions for companies. (blog.google) A multimodal model turns text, images, audio, video and documents into the same machine-readable format, so a system can match a spoken clip to a slide deck or a product photo to a manual. Google said Gemini Embedding 2, its first natively multimodal embedding model, became generally available on April 22 through the Gemini API and Vertex AI. (blog.google) Google first introduced Gemini Embedding 2 in public preview on March 10 and said it maps text, images, video, audio and documents into one shared embedding space across more than 100 languages. In practice, that lets developers build search and classification systems that work across different kinds of files instead of just text. (blog.google) At the same event, Google introduced the Gemini Enterprise Agent Platform, a set of tools for building and managing AI agents inside business software. Google’s Cloud Next roundup said the platform includes Workspace Intelligence and a no-code Agent Designer aimed at employees who are not traditional developers. (blog.google) Google has been moving toward this “agentic” pitch for more than a year: models do not just answer prompts, but retrieve information, reason through steps and then act in software. In December 2024, Google introduced Gemini 2.0 as “a new AI model for the agentic era,” tying multimodal output and tool use to that broader strategy. (blog.google) The hardware announcement matched that software pitch. Google said its eighth-generation Tensor Processing Units, TPU 8i and TPU 8t, split the work between fast inference for responsive agents and large-scale training for building more complex models. (blog.google) Google said TPU 8i is designed for low-latency inference, the part of AI computing that happens when a deployed model answers a request in real time. TPU 8t is built for training and can run larger models on a single memory pool, with general availability planned later in 2026. (blog.google) The company framed the announcements as a direct enterprise offering, not just a research demo. CNET reported that Google paired the new chips with updates to the software stack businesses use to run AI, as Google tries to sell customers on a full system rather than a standalone model. (cnet.com) That matters in a cloud market where companies are comparing model quality, chip access, security controls and the cost of running AI at scale. Google said at Cloud Next that the roadmap now runs from Gemini models to Vertex AI tools to custom silicon, with agents as the workload tying those layers together. (blog.google) The thread running through all of it was less about a single chatbot than about software that can work across file types and then do something useful with the result. Google spent Cloud Next arguing that enterprise AI is moving from prompt-and-response toward retrieve-reason-act systems it can host on its own stack. (blog.google)