UOregon AI traces DNA evolution
- University of Oregon researchers unveiled cxt, a DNA-reading AI model that reconstructs shared gene ancestry fast enough to turn a bottleneck into routine analysis. (eurekalert.org) - The key number is scale: cxt can generate more than 1 million coalescence predictions in minutes while matching leading statistical methods in-distribution. (pnas.org) - That matters because genealogy inference underpins population genetics, and older methods can bog down on large or incomplete genomic datasets. (eurekalert.org)
DNA ancestry is one of those hidden layers of biology that everything else sits on top of. If you want to know when a useful mutation appeared, how populations split, or how a pathogen lineage spread, you need a decent reconstruction of who inherited what from whom. (eurekalert.org) The problem is that this work is mathematically heavy and often slow. What changed this week is that a University of Oregon team pushed out a new AI model, called cxt, that treats DNA more like language and can infer shared ancestry at a scale that used to be a real bottleneck. (pnas.org) ### What did they actually build? They built a language model for population genetics — basically a transformer inspired by GPT-2, but trained on simulated evolutionary data instead of human text. (eurekalert.org) Its job is to look at mutation patterns along genomes and predict coalescence times, which is the technical term for how far back you have to go before two gene copies meet at a common ancestor. The paper calls this “translation” from observed mutations to hidden ancestry. ### Why is coalescence such a big deal? Because this is the backbone of evolutionary inference. Population geneticists use coalescence to reconstruct demographic history, estimate ancestral relationships, and reason about how genomes changed over time. (eurekalert.org) The catch is that the full family tree behind a genome — the ancestral recombination graph — is mostly invisible. Mutations are the breadcrumb trail, but reading that trail with classical probabilistic methods takes serious computation and usually some strong assumptions. ### What was broken before? The old methods were not bad — in many cases they are still the gold standard. (eurekalert.org) But they are specialized, slower, and can struggle when datasets get large or messy. That matters now because genomics has changed shape. Researchers are dealing with bigger datasets, more species, more field samples, and more incomplete data than the older workflows were really designed to handle gracefully. ### So what’s the new trick? Instead of hand-specifying every statistical relationship, cxt learns mutation patterns from simulations of evolution across species including bacteria, rodents, mosquitoes, and primates. (pnas.org) That gives it a broad training base. Once trained, it can infer ancestry quickly at runtime — the paper says it can produce more than 1 million coalescence predictions in minutes. That is the part that makes people pay attention. ### Is it actually good, or just fast? Turns out the pitch is not “fast but sloppy.” In the PNAS paper, cxt performs competitively with state-of-the-art Markov Chain Monte Carlo likelihood models. (eurekalert.org) It matches their accuracy on in-distribution tests and comes close even when the data are outside the training distribution. It also produces calibrated approximate posteriors, which means researchers get uncertainty estimates instead of just a black-box guess. ### Did they test it on real genomes? Yes — not just simulations. The team applied cxt to population genomic data from humans and mosquitoes. That matters because mosquitoes are a useful stress test for messy real-world evolutionary questions, and human data are the obvious benchmark for ancestry-related inference. (eurekalert.org) The model’s ability to handle both is a sign that this is meant as a practical research tool, not just a neat demo. ### What kinds of questions could this speed up? Andrew Kern’s lab frames it in very concrete terms: when did a disease-resistance gene emerge, and when did a species evolve a key trait? Those are classic population-genetics questions, but faster inference changes how often you can ask them. (pnas.org) A workflow that used to be expensive and slow becomes something you can run, compare, and iterate on much more freely. ### What’s the bottom line? This is not “AI solved evolution.” It is more useful than that. University of Oregon’s cxt looks like a serious attempt to make genealogical inference faster without throwing away rigor — and if it holds up broadly, it could make ancestry reconstruction feel less like a special project and more like standard lab infrastructure. (pnas.org) (eurekalert.org)