CompleteRXN releases open reaction database
- Gabriel Vogel, Minouk Noordsij, Evgeny Pidko, and Jana Weber released CompleteRXN, an open benchmark for filling missing pieces in chemical reaction records. (arxiv.org) - The dataset pairs incomplete and atom-balanced reactions from USPTO-linked records, and their CRB baseline hit 99.20% on random tests and 91.12% OOD. (arxiv.org) - It matters because open reaction data has existed for years, but missing reagents and byproducts still break many synthesis-planning and prediction workflows. (open-reaction-database.org)
Chemical reaction datasets are supposed to be the raw material for synthesis AI. But a lot of them are missing important pieces — byproducts, co-reactants, even the stoi(arxiv.org)chemically complete. That sounds like a bookkeeping problem. It isn’t. If the training data is incomplete, models for reaction prediction and retrosynthe(arxiv.org)a new open benchmark built to attack exactly that gap. (arxiv.org) ### What’s actually missing in reaction databases? A surpr(open-reaction-database.org)derived datasets, often record the main transformation but drop the “supporting cast” — salts, oxidants, leaving groups, side products, balancing coefficients. For a human chemist, some of that can be inferred. For a model, those omissions turn a real reaction into an under-specified string. (arxiv.org) ### Why is that a serious problem? Because many downstream tools assume the reaction record is complete enough to reason over. Retrosynthesi(arxiv.org)a-cleaning pipelines all get worse when the input is missing matter. A model can look accurate on sanitized benchmarks but then fail on messier real-world reactions, basically because the chemistry was only half written down. (arxiv.org) ### So what did CompleteRXN add? The team built a supervised benchmark of aligned incomplete reactions and atom-balanced versions of those same r(arxiv.org)onto curated mechanistic reactions, which gives the model a before-and-after view: here is the incomplete patent-style entry, and here is the chemically completed version you’d want a machine to recover. The dataset is also openly posted on Hugging Face. (arxiv.org) ### What is the model trying to do? Not invent a new synthesis route. Just complete the record. That means pred(arxiv.org)reaction under realistic missing-data conditions. It’s closer to reconstructing a torn recipe than writing a new one from scratch — useful precisely because so many chemistry pipelines start from imperfect archival data. (arxiv.org) ### How well does it work? Their main baseline, the Constrained Reaction Balancer, or CRB, scored 99.20% equivalence accuracy on a random split and 91.12% on an extreme out-of-dis(arxiv.org) the drop matters. It says the task gets much harder once the model sees reactions that are structurally farther from the training examples. SynRBL, an existing algorithmic rebalancing method, produced many plausible balanced reactions too, but with lower benchmark accuracy. (arxiv.org) ### Why does the out-of-distribution result matter so mu(arxiv.org)t set looks too much like the training set. The paper’s most useful warning is that performance falls further on full uncurated USPTO data outside the benchmark. In other words, even a very good completion model still struggles once it leaves the cleaned evaluation world and hits raw reaction records. (arxiv.org) ### How does this fit with older open chemistry efforts? It complements them more than it replaces them. The Open Reaction Database already provides an (arxiv.org)ed reaction data. CompleteRXN goes after a narrower but painful problem inside that ecosystem — incomplete historical records — and turns it into a benchmarkable machine-learning task. (open-reaction-database.org) ### What’s the bottom line? CompleteRXN is useful because it treats data quality as the bottleneck, not just model architecture. If chemistry AI is going (arxiv.org)tion records that actually describe what happened. This release gives researchers an open way to measure that gap — and start closing it. (arxiv.org)