Agentic MLE tooling advances
UCSD’s AIBuildAI agent reportedly topped the MLE‑Bench with a 63.1% medal rate, autonomously assembling models across vision, text and time‑series tasks in a way that matches expert human workflows. (x.com) The project suggests automated agent teams can handle end‑to‑end model building, which could shift bottlenecks from model design to deployment governance and testing. (x.com)
Machine learning engineering is the part of artificial intelligence where you do the unglamorous work: clean the data, pick a model, write training code, run experiments, and package the result so it can actually make predictions later. OpenAI’s MLE-bench turned that workflow into a test by pulling 75 real Kaggle competitions that require dataset prep, model training, and repeated trial and error. (arxiv.org) A Kaggle competition is basically a timed leaderboard for prediction problems, and a “medal rate” means how often a system scores high enough to land in bronze, silver, or gold territory. In the original MLE-bench paper, the best setup reached bronze level in 16.9% of competitions, which shows how far these agents still were from strong human competitors in late 2024. (arxiv.org) That baseline moved fast. The live MLE-bench leaderboard now shows Famou-Agent 2.0 at 64.44% overall and UC San Diego-linked AIBuildAI at 63.11%, both far above the original paper’s 16.9% result. (mlebench.com) AIBuildAI is not just a chatbot that suggests a model name. Its GitHub page says it runs an agent loop that analyzes the task, designs models, writes code, trains them, tunes settings, evaluates the results, and then iterates again. (github.com) The important shift is that the agent is handling the whole kitchen, not just handing the cook a recipe. AIBuildAI’s own example run outputs model checkpoints, an inference script, a prediction file, and a progress report, which means it is trying to leave behind something a team could actually run after the experiment ends. (github.com) The leaderboard also shows where the system is strong and where it still bends. AIBuildAI scores 77.27% on low or lite tasks, 61.40% on medium tasks, and 46.67% on high-difficulty tasks, so the drop-off with harder problems is real even when the overall number looks impressive. (github.com) This is why people in machine learning care about “agentic” tooling. If an agent can already do the first 80% of a Kaggle-style project in 24 hours, the scarce human work shifts away from writing boilerplate training loops and toward checking data leakage, validating metrics, and deciding whether the model should be deployed at all. (github.com 1) (github.com 2) The leaderboard itself hints at another change: progress is now coming from the combination of model and workflow, not model alone. Famou-Agent 2.0, CAIR MARS+, MLEvolve, PiEvolve, and AIBuildAI all cluster between roughly 61% and 64%, which suggests the scaffolding around the language model is becoming as important as the language model inside it. (mlebench.com) That makes deployment the next bottleneck. A system that can autonomously generate training code, checkpoints, and inference scripts is useful only if someone can verify the data sources, reproduce the run, test failure cases, and monitor the model after release. (github.com) (arxiv.org) The headline number is 63.11%, but the deeper story is that end-to-end model building is turning into a software problem with benchmarks, leaderboards, and reusable agent loops. Once that happens, the hard part stops being “can we build a model” and starts being “can we trust what the builder just built.” (mlebench.com) (github.com)