DeepInsightTheorem Post
- A new post introduced DeepInsightTheorem, which claims better mathematical reasoning through hierarchical proofs. - The writeup describes a structured proof hierarchy to improve model math outputs and attracted community attention. - Researchers and practitioners are discussing whether hierarchical proof scaffolding can measurably improve multi‑step reasoning benchmarks. (x.com)
A new paper posted on April 17 says large language models do better at math proofs when they are trained to write them in layers instead of all at once. (arxiv.org) The paper, “Learning to Reason with Insight for Informal Theorem Proving,” introduces DeepInsightTheorem, a dataset that breaks each proof into three parts: core techniques, a proof sketch, and the full proof. The authors are Yunhe Li, Hao Shi, Bowen Deng, Wei Wang, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Siyang Gao, Chao Wang, Shuang Qiu, and Linqi Song. (arxiv.org) The authors say the bottleneck is not fluent writing but “insight” — picking the right idea early enough to solve a hard problem. Their training method, Progressive Multi-Stage Supervised Fine-Tuning, teaches models first to write proofs, then to identify techniques, then to connect those techniques to complete solutions. (arxiv.org) Informal theorem proving means writing a mathematical argument in ordinary language and notation, not in a proof assistant such as Lean, Coq, or Isabelle. The paper argues that setup fits language models better because those systems are pretrained on natural language rather than on formal proof code. (arxiv.org) That focus lands as researchers are putting more weight on proof-writing tests instead of short-answer math scores. IMProofBench, one active benchmark, says it measures research-level mathematical proofs with a mostly private set of PhD-level questions; its public leaderboard currently shows GPT-5.4 at 47.9% and Gemini 3.1 Pro Preview at 46.8%. (improofbench.math.ethz.ch) Another recent benchmark paper, “Towards Robust Mathematical Reasoning,” says existing evaluations are often too easy or reward only final answers. It introduced IMO-Bench and reported 65.7% on its advanced proof benchmark for Gemini Deep Think, alongside 1,000 human gradings used to build an automatic proof grader. (arxiv.org) DeepInsightTheorem is not the only group trying to impose structure on proofs. A 2025 Association for Computational Linguistics paper on formal theorem proving used a five-level hierarchy inside model attention and reported gains of 2.05% on miniF2F and 1.69% on ProofNet. (aclanthology.org) The open question is whether scaffolds like “core technique → sketch → proof” hold up outside a paper’s own training setup and benchmarks. For now, the post has put one concrete claim on the table: if models can be taught to find the big idea first, their math may get easier to grade — and harder to fake. (arxiv.org)