Google Drops Faster, Cheaper Gemini Model
Google has unveiled Gemini 3.1 Flash-Lite, its latest AI model optimized for speed and cost-efficiency at scale. The model is aimed at developers building real-time applications like chatbots, recommendation engines, or other features that require fast and affordable inference, giving startups a powerful new option for embedding AI.
Released on March 3, 2026, Gemini 3.1 Flash-Lite is positioned as Google's fastest and most economical model within the Gemini 3 series, specifically engineered for high-volume, low-latency tasks. It is available in preview through the Gemini API in Google AI Studio and for enterprise use via Vertex AI. The model is priced at $0.25 per million input tokens and $1.50 per million output tokens. This pricing makes it more cost-effective than the previous Gemini 2.5 Flash model. Performance benchmarks indicate that Flash-Lite has a 2.5 times faster "Time to First Answer Token" and a 45% increase in output speed compared to Gemini 2.5 Flash. A key feature for developers is the introduction of "thinking levels," which allows for control over the model's reasoning depth. This enables developers to balance performance and cost, using minimal thinking for simpler tasks like translation and higher levels for more complex requests like user interface generation. Tulsee Doshi, the senior director of product management for the Gemini team, noted this architecture combines high performance with low cost for large-scale developer tasks. Gemini 3.1 Flash-Lite is a multimodal model capable of processing text, images, audio, and video inputs with a context window of up to 1 million tokens. It has demonstrated strong performance on benchmarks, achieving an Elo score of 1432 on the Arena.ai Leaderboard, outperforming other models in its tier. Startups were among the early adopters of Flash-Lite. Whering, a digital wardrobe app, has utilized the model to achieve 100% consistency in its automated item tagging. Latitude, a company focused on AI-powered gaming, is using the model's speed and instruction-following capabilities to enable more sophisticated storytelling. The model is also designed for tasks such as high-volume content moderation, data extraction, and as a router to classify and direct queries to more powerful models when necessary. For instance, the open-source Gemini CLI uses Flash-Lite to determine a task's complexity and route it to either Flash or Pro accordingly, optimizing both cost and latency. Andrew Carr, Chief Scientist at Cartwheel, highlighted the model's speed and its utility in multimodal labeling use cases at a large scale.