DeepMind co‑mathematician hits 48% on FrontierMath
- DeepMind’s multi‑agent co‑mathematician scored 48% on the FrontierMath benchmark while assisting an Oxford professor to solve a group theory puzzle. - The system’s 48% FrontierMath result was highlighted as evidence that multi‑agent setups can aid formal mathematics research and collaboration. - Researchers are treating this as a practical test of AI‑assisted math workflows rather than a finished theorem‑proving product. (x.com)
Mathematics AI just took a weirdly practical step forward. Google DeepMind posted a new “AI co-mathematician” system on May 7, and the headline number is big: 48% on FrontierMath Tier 4, which DeepMind says is the best published result yet on that research-level benchmark. But the more interesting part is not the score. It’s that the system also helped Oxford mathematician Marc Lackenby crack Problem 21.10 from the Kourovka Notebook — an open group theory problem — in a workflow that looks more like collaboration than chatbot Q&A. (arxiv.org) ### What is this thing, exactly? It’s not a single model spitting out proofs. DeepMind describes it as a workbench — basically a stateful research environment where multiple agents handle different jobs like ideation, literature search, computation, theorem proving, and theory building. The point is to mirror how actual math gets done: lots of dead ends, partial ideas, revisions, and handoffs, not one clean march from prompt to answer. (arxiv.org) ### Why does 48% matter? Because FrontierMath is deliberately nasty. The benchmark uses unpublished problems, automated verification, and questions that can take experts hours or days. Epoch’s paper framed the gap pretty starkly — at release, state-of-the-art systems were solving under 2% of problems. Tier 4 is the research-level slice. So 48% is not “math solved.” But it is a huge jump over the baseline people were used to seeing on this benchmark. (arxiv.org) ### Is this directly comparable to older scores? Mostly, but with a catch. DeepMind’s paper says the 48% result is on FrontierMath Tier 4 and calls it a new high among evaluated AI systems. But this is still a system-level result — agents, tools, memory, orchestration — not just a naked base model answering in one shot. That matters because the news here is as much about scaffolding as raw model intelligence. The gain suggests math performance is now heavily shaped by workflow design. (arxiv.org) ### What happened with Marc Lackenby? DeepMind says Lackenby used the system on several topology and group theory problems, and one line of work led to a resolution of Kourovka Notebook Problem 21.10. The telling detail is that the process was not “AI found theorem, human signed off.” An early proof attempt had a gap. A reviewer-style agent surfaced the flaw. Lackenby then recognized how to close it. That’s a very different story from autonomous theorem proving — more like a strong, tireless collaborator that can push, critique, and keep track of branches. (arxiv.org) ### Why use many agents instead of one smart model? Because math research is branching work. You want one thread checking literature, another trying examples, another stress-testing a proof sketch, another writing up formal artifacts. A single chat thread tends to forget, collapse paths together, or overcommit to the first plausible idea. A multi-agent setup is closer to having a small research group — except one that never gets tired and remembers every failed attempt. That’s the basic bet behind this whole system. (arxiv.org) ### So is this a theorem-proving breakthrough? Not exactly. DeepMind itself frames it as early tests of AI-assisted mathematical discovery. The paper talks about helping solve open problems, identify directions, and uncover missed references. That’s broader than formal proof systems like Lean-style theorem proving, and also messier. The product here is a research companion, not a finished machine that can independently produce trusted mathematics end to end. (arxiv.org) ### What’s the real takeaway? The real shift is that math AI is moving from “can the model answer this problem?” to “can the system participate in research?” FrontierMath still matters because it gives a hard number. But the Lackenby example is why people are paying attention. It hints that the next gains may come from AI that can explore with humans, not just perform for benchmarks. (arxiv.org)