Open-Source AI Tackles Genomics
A new open-source AI model named Evo was trained on trillions of bacterial genome bases. The landmark model can predict or even generate plausible new gene sequences, a major advance that could accelerate research in synthetic biology and drug discovery.
The project is a collaboration between the Arc Institute, NVIDIA, and researchers from Stanford University, UC Berkeley, and UC San Francisco. Their goal is to apply large-scale AI to biology in the same way large language models are used for human text. The latest version, Evo 2, was trained on a massive dataset of 9.3 trillion nucleotides from over 128,000 genomes, covering all three domains of life: bacteria, archaea, and eukaryotes (including humans and plants). Its predecessor, Evo 1, was trained exclusively on single-cell genomes. The model's StripedHyena 2 architecture is key to its power, allowing it to process incredibly long DNA sequences of up to 1 million base pairs at once. This long-context window enables the AI to understand relationships between distant parts of a genome. The training process took several months and utilized over 2,000 NVIDIA H100 GPUs. In a demonstration of its predictive power, Evo 2 achieved over 90% accuracy in distinguishing between benign and potentially pathogenic mutations in the BRCA1 gene, which is associated with breast cancer. The model can identify these variants without specific pre-training on the gene itself. Beyond prediction, Arc researchers have used the model to generate functional synthetic bacteriophages, which could have applications for treating antibiotic-resistant bacteria. An earlier version of Evo successfully generated a completely novel and functional CRISPR-Cas system as a proof of concept. The entire project—including the model weights, training data, and code—has been made fully open-source. It is also integrated into NVIDIA's BioNeMo framework, a platform designed to accelerate AI-driven biological research for scientists globally.