Gemini Embeddings go multimodal
- Google announced Gemini Embedding 2 GA as a unified multimodal embedding for text, images, audio, video and PDFs. (x.com) - The embedding supports up to 8K tokens of context and roughly 120 seconds of video input for unified searches. (x.com) - That lets retrieval and multimodal agents work from one vector space, simplifying toolchains for robotics and enterprise search. (x.com)
Google has turned Gemini Embedding 2 into a general-availability model that can place text, images, audio, video, and PDF files in one shared search index. (ai.google.dev) Embeddings turn content into lists of numbers so software can compare meaning instead of exact words. Google’s Gemini API now says `gemini-embedding-2` is its first multimodal embedding model, with support for text, images, video, audio, and documents in a unified space across more than 100 languages. (ai.google.dev) That shared space means a text query can retrieve an image, a video clip, or a PDF page if the system judges them semantically similar. Google’s Vertex AI docs describe the same setup as a way to “search for an image based on a text description.” (cloud.google.com) The model’s technical limits are larger than Google’s earlier Gemini embedding release. Vertex AI lists a maximum input length of 8,192 tokens, output vectors up to 3,072 dimensions, audio inputs up to 180 seconds, and video inputs capped by token limits at about 81 seconds with audio or about 120 seconds at one frame per second without audio. (cloud.google.com) Google first released the multimodal version as `gemini-embedding-2-preview` on March 10, 2026, then promoted `gemini-embedding-2` to general availability on April 22, 2026. (ai.google.dev) The shift folds several retrieval jobs into one model instead of separate text and image pipelines. Google’s docs say the model accepts interleaved inputs across image, text, document, audio, and video modalities, and returns a single aggregated embedding for multiple inputs. (cloud.google.com, ai.google.dev) Google is also positioning it for search and agent systems that need better routing between formats. The Gemini API docs tie embeddings to retrieval-augmented generation, while Vertex AI says developers can add task instructions such as search, question answering, fact checking, and code retrieval to tune results for a specific job. (ai.google.dev, cloud.google.com) The company is not replacing every older option with this model. Google’s Gemini API docs still keep `gemini-embedding-001` for text-only use cases, and Vertex AI says text-focused retrieval and long-form document analysis may still fit dedicated text-embedding tools better. (ai.google.dev, cloud.google.com) The immediate change is practical: one vector space, one model ID, and one set of retrieval logic for mixed media. After a March preview and an April 22 general release, Google is now selling multimodal search as a standard building block rather than a side feature. (ai.google.dev, cloud.google.com)