Google Unveils Gemini Embedding 2
Google launched Gemini Embedding 2, a multimodal AI that maps text, images, video, and audio into a unified embedding space, enabling advanced RAG across 100+ languages.
Gemini Embedding 2 supports up to 8,192 input tokens for text and can process up to six images per request in PNG and JPEG formats. It also handles videos up to 120 seconds in MP4 and MOV formats, and PDF documents up to six pages. Audio is processed directly, removing the need for transcription. This model uses Matryoshka Representation Learning (MRL), allowing developers to scale down embedding vector dimensions to balance performance and storage costs. Google recommends dimensions of 3,072, 1,536, and 768 for optimal quality. It's available through Google's Gemini API and Vertex AI. Gemini Embedding 2 outperforms competing models like Amazon's Nova 2 and Voyage Multimodal 3.5 in benchmarks across text, image, video, and spoken language tasks. Notably, it shows significant gains in text-to-video tasks. The model simplifies AI pipelines by mapping different data formats into a unified representation.