LLM teams prototype proof-sketching with AlphaProof calls

- Google DeepMind researchers said on May 22 their AlphaProof Nexus framework paired LLM proof generation with Lean verification and AlphaProof calls for focused subgoals. - The paper’s clearest figure was 9 of 353 open Erdős problems solved, with harder cases costing a few hundred dollars each. - Microsoft’s PlugMem paper and Microsoft Learn observability guidance are the next concrete references for memory design and AI-system tracing.

Google DeepMind researchers said on May 22 that a new formal-math framework, AlphaProof Nexus, combines large language models with Lean verification and can call AlphaProof as a focused proving tool on harder subgoals. In a paper posted that day, the team said its “full-featured” agent solved 9 of 353 open Erdős problems and proved 44 of 492 OEIS conjectures, while a simpler LLM-plus-Lean agent was more expensive on the hardest tasks. The discussion that followed on May 23 across developer and research threads centered less on a single product launch than on a pattern: use natural-language models for proof sketches, then hand formal subproblems to verifiers or specialized provers. That approach tracks the underlying limitation named in the DeepMind paper itself — LLM-generated arguments can contain subtle errors, and unreviewed intermediate steps can cascade through a proof. (arxiv.org) ### Why are researchers talking about “proof sketches” instead of full proofs? Large language models are good at proposing strategies, decompositions and auxiliary lemmas, but Lean and similar proof assistants are the systems that check whether each inference step is valid. DeepMind’s paper said the basic agent alternated “LLM-based generation” with “Lean-based verification,” and the full agent added coordinated subagents plus AlphaProof for focused proof search. (arxiv.org) A separate 2026 paper, “Prover Agent,” describes a similar division of labor. Its authors said their system coordinates an informal reasoning LLM, a formal prover model and Lean feedback, while also generating auxiliary lemmas to discover an overall proof strategy. ### Where does AlphaProof fit in that stack? AlphaProof is Google DeepMind’s reinforcement-learning system for formal mathematical reasoning. (arxiv.org) DeepMind said in 2024 that AlphaProof and AlphaGeometry 2 together solved four of six International Mathematical Olympiad problems at silver-medal level, and that AlphaProof handled algebra and number-theory problems by determining answers and proving them correct. (openreview.net) In the new AlphaProof Nexus paper, the authors said subagents in the full system “can use AlphaProof … as a focused proof tool.” That matters because it places AlphaProof not as a general chat model, but as a specialist component invoked on targeted formal tasks after broader search or sketching has already narrowed the problem. ### Why did cost come up in these discussions? The DeepMind team put a price tag on the tradeoff. (deepmind.google) Their paper said the most capable agent solved those 9 Erdős problems “at the per-problem cost of a few hundred dollars,” while the simpler LLM-and-Lean setup “proved costlier on the hardest problems.” That makes the architecture itself part of the result. Researchers are not only asking whether a model can reason; they are asking which parts of the reasoning loop deserve expensive formal search and which parts can stay in cheaper natural-language exploration. (arxiv.org) That is an inference from the paper’s comparison of the basic and full-featured agents. ### What does the memory thread add to this? Microsoft Research said on March 10 that its PlugMem system was built around a similar practical question: not how to store more interaction history, but how to turn it into structured, reusable knowledge. (arxiv.org) The researchers said raw retrieval can swamp agents with long, low-value context, while a general-purpose memory module improved performance across benchmarks with fewer memory tokens. That helps explain why May 23 discussion also touched task-specific memory adaptation. In both theorem proving and agent memory, the engineering move is similar: compress context, isolate the reusable parts, and spend compute where it changes the outcome. That framing is supported by PlugMem’s emphasis on structured facts and reusable skills rather than raw transcript recall. (microsoft.com) ### Why are trust, verification and observability getting equal billing with model capability? Microsoft said in guidance published in 2026 that traditional observability is not enough for generative and agentic AI systems because they are probabilistic, tool-using and increasingly autonomous. The company said teams need AI-native logs, metrics and traces that capture user inputs, retrieval provenance, tool invocations, permissions and outputs, with retention governed by privacy and compliance rules. (microsoft.com) That is close to the same operational logic driving formal-proof workflows. If a system is going to call tools, decompose tasks and produce intermediate claims, teams want grounding, verification and evidence trails — not only larger models. The next concrete places to watch are follow-on work around AlphaProof Nexus, including formal-math agent papers, and enterprise documentation on memory modules and AI observability from groups such as Microsoft Research and Microsoft Learn. (learn.microsoft.com) (arxiv.org)

LLM teams prototype proof-sketching with AlphaProof calls

Get your own daily briefing