AI cxt traces evolution in minutes
- University of Oregon researchers published cxt in PNAS, a DNA language model that reconstructs genetic ancestry fast enough to turn days-long inference into minutes. (pmc.ncbi.nlm.nih.gov) - The model makes millions of most-recent-common-ancestor estimates in minutes, matches top coalescent methods closely, and can output calibrated uncertainty instead of one guess. (pmc.ncbi.nlm.nih.gov) - That matters because genealogy inference is a bottleneck across population genetics, and cxt points toward faster, more general-purpose genomics models. (pmc.ncbi.nlm.nih.gov)
DNA ancestry is one of those problems that sounds simple until you try to compute it. A genome is just a long string of letters, but hidden in that string is a branching history of who shared ancestors with whom, and when. Pulling that history back out usually takes heavy statistical machinery and a lot of compute. (pmc.ncbi.nlm.nih.gov) The news is that a University of Oregon-led team says a model called cxt can do much of that work in minutes, and the paper is now out in PNAS. ### What is cxt actually doing? cxt is a language model for population genetics. Not “language” in the chatbot sense — more like translation. The model reads local mutation patterns in DNA and predicts coalescence times, which is the population-genetics term for how far back two genetic lineages meet at a common ancestor. (pmc.ncbi.nlm.nih.gov) The paper frames this as “next-coalescence prediction,” basically borrowing the logic of next-token prediction and aiming it at ancestry instead of text. ### Why is that hard the normal way? The hard part is that the thing scientists want — the ancestral history — is not directly observable. What they actually see are mutations scattered along genomes. (pmc.ncbi.nlm.nih.gov) Standard methods infer the hidden genealogy from those mutations using explicit probabilistic models, often built on coalescent theory and sequential Markov assumptions. Those methods can be very strong, but they are specialized, assumption-heavy, and not always easy to scale. ### So what changed here? Instead of hand-specifying every inference step, the team trained a decoder-only transformer on evolutionary simulations from stdpopsim, a widely used simulation catalog in population genetics. (pmc.ncbi.nlm.nih.gov) That lets cxt learn statistical regularities linking mutation patterns to ancestry patterns across many demographic scenarios. The result is a model that the authors say generalizes across known and novel demographies rather than being locked to one narrow setup. ### How fast is “fast”? Fast enough to matter operationally. The paper says cxt can generate millions of most-recent-common-ancestor estimates in minutes. That is the real headline — not that AI touched genomics, but that a workflow that can bog down into long runs becomes something you can iterate on quickly. (pmc.ncbi.nlm.nih.gov) In practice, that changes how often researchers can rerun analyses, test assumptions, or work with bigger datasets. ### Does it actually hold up? Mostly, yes — with the usual caveat that speed is useless if accuracy collapses. The PNAS paper says cxt performs competitively with state-of-the-art MCMC-based likelihood models, matching their accuracy on in-distribution scenarios and coming close on out-of-distribution ones. (pmc.ncbi.nlm.nih.gov) It also produces calibrated posterior uncertainty, which matters because ancestry inference is probabilistic by nature. You do not want a model that is merely fast and overconfident. ### Why do mosquitoes and humans show up? Because toy benchmarks are easy, but real genomes are messy. The team applied cxt to empirical data from humans and mosquitoes to show it can survive contact with actual sequencing data. (pmc.ncbi.nlm.nih.gov) One especially practical detail is that the approach can be adapted with lightweight fine-tuning to handle missingness patterns in mosquito datasets — a chronic issue in field-collected samples. ### Is this building full evolutionary trees? Not exactly in the simple “press button, get the tree of life” sense. cxt predicts pairwise coalescence times across genomic windows, which are ingredients for reconstructing genealogical history rather than a single finished species tree. (pmc.ncbi.nlm.nih.gov) But those ingredients are central. If you can estimate them quickly and at scale, you can speed up downstream work in demographic inference, comparative genomics, and studies of how populations split, mixed, and adapted. ### What’s the catch? The catch is generalization. Evolutionary history in the wild can violate the assumptions baked into simulations, and no model trained on synthetic data gets a free pass there. (biorxiv.org) The authors try to address that with broad training data, post hoc correction for species mutation rates, and fine-tuning on empirical quirks. But this is still best read as a powerful inference engine, not an oracle. ### Bottom line Basically, cxt matters because it turns a slow, expert-heavy inference step into something much more iterative. That does not replace classical population genetics — it makes it more usable at modern data scale. And that is usually how real progress lands: not magic, just a bottleneck suddenly getting a lot less painful. (cxt.readthedocs.io) (pmc.ncbi.nlm.nih.gov)