ARC‑AGI‑3 trounces rivals

François Chollet's ARC‑AGI‑3 benchmark severely undercut recent frontier models — Gemini 3.1 Pro scored 0.37% and Grok‑4.20 scored 0% — highlighting reasoning‑first evaluations that stress new capabilities over scale (x.com).

ARC‑AGI‑3 was published and launched in late March 2026 as an interactive, agentic benchmark containing over 1,000 levels across more than 150 distinct environments designed for turn‑based problem solving. (arcprize.org) The benchmark uses an efficiency‑based scoring framework calibrated to human action baselines and the ARC technical report states that human test‑takers solved 100% of the environments during validation. (arxiv.org) The ARC Prize Foundation paired the launch with a $2,000,000 competition that guarantees a grand prize for the best open‑source solution and structures the challenge as an open research contest. (arcprize.org) ARC‑AGI‑3 explicitly evaluates in‑environment learning: agents must explore novel states, infer latent goals, construct internal dynamics models, and plan multi‑step action sequences without explicit natural‑language instructions. (arxiv.org) By comparison, earlier ARC‑AGI‑2 results showed frontier LM-based agents achieving high percentages on static, non‑interactive tasks (e.g., Gemini variants reached roughly the high‑70s percent on ARC‑AGI‑2 in early 2026), underscoring that ARC‑AGI‑3 targets capabilities beyond the skills those prior scores measured. (officechai.com) The ARC release included an open‑source agent toolkit, documented harness design issues, and noted early collaborations with academic partners; the public launch event on March 25 featured a fireside conversation between François Chollet and OpenAI CEO Sam Altman. (arcprize.org)

ARC‑AGI‑3 trounces rivals

Get your own daily briefing