AI agents are teaching themselves

Social signals show models are increasingly used to teach coding to other models and to convert math textbooks into runnable code — a shift toward agentic chains of learning and automation. (x.com) The same posts flagged that these advances are colliding with hardware and power bottlenecks, suggesting software progress will be gated by infrastructure for the near term. (x.com)

The new thing in AI is not just bigger models. It is models building the lessons for other models. That shift has been coming for a while. In 2023, Microsoft’s phi-1 showed that a small code model could get surprisingly strong results by training on “textbook-quality” material and on synthetic textbooks and exercises written with GPT-3.5. The point was simple and radical at once: better teaching data could matter as much as brute scale. (arxiv.org) Now that idea is turning into a workflow. Researchers increasingly use stronger models to generate prompts, answers, exercises, and critiques that become training material for weaker or cheaper models. Microsoft’s AgentInstruct gave that pattern a name, “generative teaching,” and built an agentic pipeline that starts from raw documents or code files, then produces large synthetic instruction datasets with prompts and responses already paired up. In its demonstration, the system created 25 million training pairs spanning coding, tool use, reading comprehension, and other skills. (arxiv.org) That is why the social posts behind this story feel plausible. They are describing a real change in how capability is produced. Instead of waiting for humans to hand-label examples, labs now let one model draft the curriculum, another model solve it, and a verifier check whether the result actually works. In coding, this matters because code has a hard advantage over ordinary text: it can be run, tested, and rejected automatically. That makes it ideal fuel for self-improving loops. (arxiv.org) The loop is already tightening. A 2025 paper called *A Self-Improving Coding Agent* showed an agent system that could edit its own codebase and raise its score on SWE-Bench Verified from 17% to 53% on a sampled subset, with gains on other coding benchmarks too. That is not artificial life. It is a very concrete mechanism. The agent writes changes to itself, runs evaluations, keeps what helps, and repeats. Once a model can code well enough to improve the scaffolding around its own reasoning, software starts compounding. (arxiv.org) That same compounding is pushing models toward a harder target: turning technical writing into executable systems. ResearchCodeBench, introduced by Stanford researchers in 2025, tests whether models can implement ideas from recent machine learning papers that were likely unseen during training. Even the best model in that benchmark got only 37.3% of tasks right. That number is low, but it is the important kind of low. It shows the frontier has moved from autocomplete toward “read a fresh document and build the thing.” (arxiv.org) Math is part of this story for the same reason code is. Textbooks, proofs, and worked examples are structured enough to be turned into synthetic curricula, and formal math systems make correctness easier to check than in ordinary prose. The dream is not just a chatbot that explains algebra. It is a chain in which one model reads a chapter, extracts concepts, writes problems, generates solution code or formal steps, and hands all of that to another model as training data. The card’s claim about math textbooks becoming runnable code is best read as the leading edge of this broader document-to-execution trend, not as a solved product category. Benchmarks still show plenty of failure when models try to implement novel ideas faithfully. (arxiv.org) And that is where the bottleneck moves from software to steel, copper, and electrons. The International Energy Agency estimates data centers used about 415 terawatt-hours of electricity in 2024, roughly 1.5% of global electricity consumption. In modern facilities, servers account for around 60% of that load, with cooling and power infrastructure taking a large share of the rest. (iea.org) AI hardware is making the squeeze worse. NVIDIA said in late 2025 that moving from Hopper to Blackwell raised individual GPU power by 75%, while the jump to a 72-GPU NVLink domain drove a 3.4-fold increase in rack power density. NVIDIA’s own engineers described power infrastructure as the factor that now dictates the scale, location, and feasibility of new deployments, with megawatt-class racks on the horizon. (developer.nvidia.com) So the strange picture in 2026 is this: models are getting better at manufacturing the very lessons that make future models better, especially in coding, where execution acts like a teacher. But every turn of that loop demands more compute, denser racks, and more electricity. The software keeps finding new ways to teach itself. The grid still has to keep up with a rack that is heading toward a megawatt.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.