Auto‑research infrastructure rising

Francesco Pappone from Paradigma laid out the idea of 'auto‑research infrastructure' — systems that automate literature ingestion, experiment generation, evaluation and result tracking so teams can run continuous experiments (youtube.com). If operationalized, that model shifts the bottleneck from hiring more researchers to building orchestration that reliably compares experiments and logs outcomes, letting smaller teams iterate as fast as bigger labs (youtube.com).

A new idea is moving through AI research circles, and it is less about building a smarter model than building a better factory. In a recent interview, Francesco Pappone of Paradigma described “auto-research infrastructure” as the stack that would let machines ingest papers, generate experiments, run evaluations, and log results in a continuous loop rather than a one-off sprint (youtube.com). Paradigma’s own pitch is blunt: it says it is building “infrastructure for autonomous research,” not a single model or chatbot (paradigma.inc). That framing matters because it shifts the target. The hard part is no longer just getting an AI system to produce an idea. Plenty of systems can already do pieces of that job. A 2025 paper in *National Science Review* described an automated review-generation method that synthesized 343 papers across 35 topics and reported hallucination risk below 0.5% after layered quality control and expert verification (academic.oup.com). A 2024 overview of retrieval-augmented generation for systematic literature reviews broke the workflow into four stages that can be automated: search, screening, extraction, and synthesis (mdpi.com). Those are the front-end chores of research. They save time, but they do not yet create a real research engine. The more ambitious push is to connect those chores to hypothesis generation and experimental execution. The recent *Nature* paper on The AI Scientist said its system can create research ideas, write code, run experiments, analyze results, draft a manuscript, and even perform its own peer review (nature.com). One AI-generated paper from that system passed the first round of peer review at a workshop attached to a major machine learning conference, though the workshop acceptance rate was 70%, which is not the same as clearing a flagship conference bar (nature.com; sakana.ai). Another recent project, AI-Researcher, explicitly pitches “end-to-end” automation from literature exploration to experimental validation and publication-quality reporting, while admitting that the field still lacks standard ways to measure progress across domains (arxiv.org). That missing measurement layer is exactly where the infrastructure story gets serious. If a system can spawn ten or a hundred candidate experiments, then the bottleneck becomes comparison, not generation. Modern ML platforms already hint at the pattern. Weights & Biases markets itself around experiment tracking, dataset versioning, and hyperparameter sweeps (docs.wandb.ai). MLflow emphasizes logging parameters, code versions, metrics, and artifacts so runs can be compared and reproduced (mlflow.org). Auto-research infrastructure extends that logic upward. Instead of tracking model-training runs, it tracks chains of literature claims, hypotheses, implementations, failures, replications, and revisions. Once that becomes the center of the system, team size starts to matter less than orchestration quality. Pappone argues that progress has been bottlenecked by how much intelligence humans can accumulate and operationalize, and that AI could move that bottleneck toward compute if autonomous agents can produce and verify results in a shared loop (paradigma.inc). The idea sounds grand, but there is a practical reason it keeps resurfacing: today’s research agents are already good enough that people are building benchmarks just to tell them apart. DeepResearch Bench, released in 2025, evaluates “deep research agents” on 100 PhD-level tasks across 22 fields and scores both report quality and citation accuracy (deepresearch-bench.github.io). LiveBench, a separate benchmark effort, now refreshes questions every six months to reduce contamination and keep model comparisons from going stale (livebench.ai). That is the real signal in this story. The field is no longer arguing only about whether AI can help with research. It is starting to build the plumbing for continuous research operations, where ideas are cheap, experiments are automatic, and the scarce thing is a trustworthy ledger of what actually worked. Paradigma’s website currently points visitors to a product called Flywheel (paradigma.inc). The name is almost too neat.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.