Grok 4.20 tops BridgeBench

xAI's Grok 4.20 posted results that topped the BridgeBench reasoning benchmark, reportedly surpassing models such as GPT‑5.4, Claude Opus 4.6 and Google Gemini. The claim was shared on social media as evidence of fast competitive iteration among model labs (x.com).

BridgeBench, a coding benchmark that scores models on software tasks, now shows xAI’s Grok 4.20 Reasoning in first place on its reasoning slice as of April 13. (bridgebench.ai) The BridgeBench card labels that slice a “30 tasks” benchmark for “grounded reasoning over mixed artifacts,” with Grok 4.20 Reasoning scoring 41.8, ahead of OpenAI’s GPT-5.4 at 40.6, Anthropic’s Claude Opus 4.6 at 39.6, and Google’s Gemini 3.1 Pro at 34.3. (bridgebench.ai) BridgeBench is not a general intelligence test. Its homepage says the suite covers 130-plus real-world coding tasks across algorithms, debugging, refactoring, generation, user interface work, security, speed, and a separate reasoning category. (bridgebench.ai) That distinction matters because the same BridgeBench snapshot shows different leaders in other categories on April 13: Claude Sonnet 4.6 led user interface work, Claude Sonnet 4.6 led security, Claude Opus 4.6 led debugging, and Grok 4.20 Reasoning led hallucination resistance and speed among the models shown. (bridgebench.ai) The leaderboard page also says BridgeBench is “re-running benchmarks across all models with the latest BridgeBench suite,” which means the site is still updating broader rankings even as category cards are live. (bridgebench.ai) xAI’s own developer documentation describes Grok 4.20 as its flagship model and lists a reasoning variant with a 2,000,000-token context window, plus tool calling support and pricing of $2 per 1 million input tokens and $6 per 1 million output tokens. (docs.x.ai) xAI’s news page says Grok 4 became available on July 14, 2025 to SuperGrok and Premium+ subscribers and through the xAI application programming interface, placing the newer 4.20 line inside a product cycle that has kept shifting through 2026. (x.ai) The social-media post tied to the claim circulated the BridgeBench result as a head-to-head win over rival labs, but BridgeBench’s own site frames the benchmark as a coding and agentic-development evaluation, not a single final ranking for all model use cases. (bridgebench.ai) So the cleanest reading of the result is narrow but concrete: on BridgeBench’s reasoning card on April 13, Grok 4.20 Reasoning was listed first, and the rest of the leaderboard race is still moving. (bridgebench.ai)

Grok 4.20 tops BridgeBench

Get your own daily briefing