Karpathy's 'Autoresearch' Automates ML Experiments
Andrej Karpathy introduced "autoresearch," a project that automates 100+ ML experiments overnight, managing data, model selection, and evaluation. Framing projects as scalable "agentic pipelines" could be beneficial for portfolio impact and interviews.
Karpathy's autoresearch project, released on March 9, 2026, allows AI agents to autonomously run machine learning experiments overnight, using a lean Python tool of approximately 630 lines. This framework lets AI agents iterate on ML experiments by modifying a Python training script based on human-provided instructions in a Markdown file. The system is designed for a single NVIDIA GPU, reducing complexity, and uses bits-per-byte (BPB) as the primary validation metric, where a lower score indicates a more accurate model. The agent only commits code changes if the final BPB score improves upon the previous best. Each training run is capped at five minutes, allowing for roughly 12 experiments per hour, or potentially 100 experiments overnight. Karpathy demonstrated the agent successfully reducing validation loss from 1.0 to 0.97 BPB through autonomous code iteration in initial runs. This project builds upon Karpathy's earlier work with nanochat and agentic engineering, where humans orchestrate agents instead of writing every line of code. Karpathy, a former Director of AI at Tesla and founding member of OpenAI, is now focused on modernizing education in the age of AI with his new company, Eureka Labs. He also authored Stanford's first deep learning course, CS 231n. The human role shifts to reviewing, as the AI generates, tests, and iterates without needing approval for each experiment. The success of autoresearch relies on a clear, measurable success criterion; in this case, a lower validation bits-per-byte. While autoresearch is not a replacement for data scientists, it demonstrates the potential for autonomous, self-directed, metric-driven iteration in machine learning. It is well-suited for ablation studies, architecture comparisons, and hyperparameter sensitivity analysis on small-to-medium models. Karpathy notes that the code base is already in its 10,205th iteration. Tobi Lütke, the CEO of Shopify, has already implemented the tool.