Google Unveils Gemini Embedding 2
Google launched Gemini Embedding 2, a multimodal model for text, images, video, and audio, cutting latency by 70%.
Gemini Embedding 2 maps text, images, video, audio, and PDF documents into a shared vector space. This allows direct comparison of different media types, simplifying AI pipelines for tasks like semantic search and data clustering. The model supports up to 8,192 input tokens for text and can process up to six images, 120-second videos, and six-page PDFs. It also processes audio natively, skipping the transcription step that can lose information. Google reports that Gemini Embedding 2 outperforms competitors like Amazon's Nova 2 and Voyage Multimodal 3.5 in most benchmark categories, especially in text-to-video tasks. It uses Matryoshka Representation Learning (MRL), allowing developers to scale down output dimensions for a balance of quality and storage costs. The model is available in Public Preview through the Gemini API and Vertex AI, with integrations for frameworks and vector databases like LangChain and LlamaIndex. This makes it easier for developers to incorporate the model into existing AI applications.