Stargazer benchmark flunks physics recovery

- Researchers at the University of Toronto, Vector Institute, and MPI released Stargazer, a new benchmark showing AI agents can fit exoplanet data without recovering the right physics. (arxiv.org) - The benchmark has 120 radial-velocity tasks, including 20 real archival cases, and tests eight frontier agents on whether good fits also obey astrophysical constraints. (arxiv.org) - That matters because scientific AI is drifting toward agent workflows, but Stargazer suggests pattern-matching still breaks before causal discovery does real work. (arxiv.org)

Astronomy is a good stress test for scientific AI because the numbers can look right even when the story behind them is wrong. That is basically the whole point of Stargazer, a new benchmark fro(arxiv.org)ölkopf, Zhijing Jin, and Kristen Menou. The setup asks AI agents to infer planetary systems from radial-velocity data — the tiny back-and-forth wobble a planet induces in i(arxiv.org)d completely. The news is that they often found statistically decent fits while still missing the real physical system underneath. (arxiv.org)ting? Stargazer is a benchmark for iterative, tool-using AI agents, not a multiple-choice quiz. It contains 120 tasks across three difficulty tiers, with both synthetic cases and 20 real archival astronomy cases. Each task gives an agent radial-velocity time-series data and asks it to infer the planetary configuration that produced the signal. That means the agent has to search, fit models, revise hypotheses, and submit a structured answer under astrophysical constraints. (arxiv.org) ### What is radial-velocity data? A star does not sit perfectly still when p(arxiv.org)t tug shows up as periodic shifts in the star’s spectrum. Astronomers use those shifts to estimate things like orbital periods, eccentricities, and minimum planet masses. But the inverse problem is messy — multiple planet setups can mimic each other, noisy observations muddy the signal, and physically implausible solutions can still look numerically persuasive. (arxiv.org) ### Why is that a harder test than “did the curve fit”? Because a curve fit is not the same thing as(arxiv.org)ed the data. An agent can overfit noise, settle on the wrong number of planets, or choose orbital parameters that match the observations while violating the underlying dynamics. It is the difference between tracing the outline of a lock and actually cutting the right key. Stargazer is built to measure that gap directly. (arxiv.org) ### What did the researchers find? The headline result is a mismatch between optimization and understanding. Across ei(arxiv.org)n achieved good statistical fits but frequently failed to recover the correct physical system parameters. Even adding more test-time compute helped only a little. In many cases, the extra tokens were not productive exploration — they were recursive failure loops, with agents burning budget while going in circles. (arxiv.org) ### Why does extra compute not rescue this? Because the bottleneck does not seem to be raw persistence. If (arxiv.org)ts can just mean more wrong branches. The paper’s framing is useful here — these are dynamic, feedback-rich scientific tasks, so success depends on forming and pruning hypotheses under constraints, not just grinding longer. More compute helps only if the agent knows what signal to trust and what physical structure to preserve. (arxiv.org) ### Why astronomy? Turns out exoplanet inference is a near-perfect benchmark for this problem. It is rea(arxiv.org)and it is constrained enough that researchers can tell whether an answer is merely plausible or actually right. The authors also argue that the benchmark design could generalize beyond astronomy to other scientific model-fitting problems where the real target is not prediction alone but mechanism. (arxiv.org) ### So what does this change? It is a warning shot for the “AI scientist” story. If an agent can produce a nice-looking fit without recov(arxiv.org)scores may overstate its usefulness for discovery. That does not make these systems worthless — they may still help with search, coding, and exploratory analysis. But it does mean trust should rise only when benchmarks reward physically correct reasoning, not just numerical imitation. (arxiv.org) ### Bottom line? Stargazer says the hard part is still the hard part. AI agents are getting better at navigating scie(arxiv.org)ame as discovering the world. Until those two line up more reliably, “autonomous science” remains more demo than dependable instrument. (arxiv.org)

Stargazer benchmark flunks physics recovery

Get your own daily briefing