ARC evaluation finds GPT‑5.5 and Anthropic’s Opus 4.7 improved but still fall short on ARC‑AGI‑3
- ARC Prize published a May 1 analysis of OpenAI’s GPT‑5.5 and Anthropic’s Opus 4.7 on ARC‑AGI‑3, its new interactive reasoning benchmark. - The headline numbers were tiny: GPT‑5.5 scored 0.43% and Opus 4.7 scored 0.18%, after ARC reviewed 160 replays and traces. - That matters because humans solve all ARC‑AGI‑3 environments, while frontier models still break on novelty, abstraction, and learning.
ARC‑AGI‑3 is a benchmark for something current AI still struggles with — walking into a new environment, figuring out the rules, and getting better as it goes. That sounds basic. Humans do it constantly. But ARC Prize’s new analysis says OpenAI’s GPT‑5.5 and Anthropic’s Opus 4.7 still fall apart on exactly that kind of task. The news is not that the models are useless. It’s that even improved frontier systems remain nowhere near humanlike generalization on this benchmark. (arcprize.org) ### What is ARC‑AGI‑3? It’s the third generation of François Chollet’s ARC benchmark line, but this version is interactive. Instead of static puzzles, agents enter novel turn-based environments with no instructions, then have to explore, infer goals, build a world model, and adapt over multiple steps. ARC says the environments are hand-(arcprize.org)the test is aimed at fluid adaptation rather than polished chat behavior. (arcprize.org) ### What happened this week? On May 1, ARC Prize published an analysis package covering 160 replays and reasoning traces from GPT‑5.5 and Opus 4.7. ARC’s own post frames the raw scores as almost the least interesting part. The bigger point is that ARC‑AGI‑3 lets evaluators inspect the models’ step-by-step behavior — where they form a hypothesis, where they abandon a (arcprize.org 1) (arcprize.org 2) ### How bad were the scores? Very bad, in absolute terms. ARC reports GPT‑5.5 at 0.43% and Opus 4.7 at 0.18% on the semi-private ARC‑AGI‑3 dataset. The technical report says frontier AI systems, as of March 2026, score below 1% on ARC‑AGI‑3, while human testing was used to ensure the environments are 100% solvable by people. So yes, GPT‑5.5 beat Opus 4.7 here, but both are still pinned near the floor. (arcprize.org) ### Why is this benchmark different? Because it cares about learning over time, not just final answers. ARC says a 100% score would mean an agent can beat every game as efficiently as humans. That makes the benchmark less like a school exam and more like being dropped into a strange toy world and having to reverse-engineer the rules fro(arcprize.org)ething from training, ARC counts that as missing the point. (arcprize.org) ### Where did the models fail? ARC says three failure modes kept showing up. First, “true local effect” — the model notices that an action changed something but fails to turn that into a global rule. Second, “wrong level of abstraction from training data” — the model mistakes the environment for some familiar game and applies the wrong schema. Third, “solved the leve(arcprize.org)gh one level without extracting the rule well enough to transfer it forward. Basically, the systems can sometimes poke the world successfully without really understanding it. (arcprize.org) ### Does this mean frontier models made no progress? No — and that’s the subtle part. ARC chose GPT‑5.5 and Opus 4.7 precisely because they are stronger frontier systems, and the analysis is about how improved models still fail under novelty. The gap is not “smart versus dumb.” The gap is “impressive on many benchmarks” versus “reliably(arcprize.org)me point: intelligence here is tied to efficiency and adaptation, not just eventually getting some tasks right with more compute. (arcprize.org) ### Why should product teams care? Because benchmark gains can hide brittle behavior. A model that writes code well, talks fluently, or even performs strongly on other evals may still fail when the job requires building a fresh internal model of a new system. That matters for agents — especially ones expected to operate autonomously in (arcprize.org)ng against reading “better frontier model” as “general-purpose robust learner.” (arcprize.org) ### So what’s the bottom line? The story is not that GPT‑5.5 and Opus 4.7 bombed a quirky puzzle set. It’s that a benchmark built around novelty, adaptation, and transfer still exposes a giant gap between frontier AI and humans. The models improved. But the kind of generalization people casually mean when they say “AGI” still looks very far away on this test. (arcprize.org)