Gemini 3.1 Pro Leads Benchmarks

Google’s Gemini 3.1 Pro is topping recent benchmark leaderboards, hitting 94.1% on the LM Council preview and taking three of the top four slots — leaving Anthropic’s Claude 4.6 Opus and OpenAI’s GPT‑5.4 behind by double digits. That benchmark shift raises the bar for engineers expected to integrate multiple GenAI stacks in product work. (arturmarkus.com)

LM Council’s March 31, 2026 benchmark update identifies GPQA Diamond as a 198‑question, PhD‑level multiple‑choice science test used to compare frontier models. (lmcouncil.ai) The LM Council page documents a ±1.7% uncertainty for the top GPQA result and records Gemini 3.1 Pro at 79.6% on LM Council’s SimpleBench secondary evaluation track. Google DeepMind’s Gemini 3.1 Pro model card, published February 2026, describes the model as a natively multimodal reasoning model based on Gemini 3 Pro and specifies a 1,048,576‑token context window. (deepmind.google) Reporting from March 31, 2026 notes published pricing for the Gemini 3.1 Pro preview at approximately $2 per million input tokens and $12 per million output tokens, and lists Gemini 3 Flash and Gemini 3 Pro Preview as other performance‑tier variants with reported benchmark scores of 48.1% and 37.52% respectively. (arturmarkus.com) LM Council’s broader scoreboard shows GPT‑5.4 at 83.0% on the GDPval occupational knowledge‑work benchmark, while SWE‑bench Verified places Claude Opus 4.6 at 78.7% and Gemini 3.1 Pro at 75.6% on a human‑verified code‑fix task. (lmcouncil.ai) Third‑party analysis cited in the coverage calculates a roughly 60% inferencing cost gap favoring Google’s published rates and argues that the lower per‑query prices materially improve the economics of running multi‑model ensembles in inference‑heavy production services. (arturmarkus.com)

Gemini 3.1 Pro Leads Benchmarks

Get your own daily briefing