Largest Open DNA Model Published
The Evo 2 DNA foundation model has been published in *Nature*, marking a milestone for open-source AI in biology. The Arc Institute credits its success to collaborative, open science, demonstrating that public models can compete with proprietary ones in driving biotech research.
The Evo 2 model was trained on a colossal 9.3 trillion nucleotides from over 128,000 genomes, spanning bacteria, archaea, plants, and animals. This massive dataset, named OpenGenome2, is also open-source, allowing researchers to build upon the same curated genetic information. The flagship version of the model has 40 billion parameters, with smaller 7B and 20B variants available for labs with less computational resource. Under the hood, Evo 2 bypasses the standard Transformer architecture. It uses a novel hybrid model called StripedHyena 2, which combines elements of Mamba-style state space models with attention mechanisms. This design, which received contributions from OpenAI co-founder Greg Brockman, allows for near-linear scaling and can process an enormous context window of one million nucleotides at once, a significant leap over its predecessor. This massive context window is key for understanding long-range interactions within the genome, which was a major limitation of previous models. It enables Evo 2 to analyze entire genes and their regulatory regions simultaneously, leading to more accurate predictions about the functional impact of genetic variations, including noncoding pathogenic mutations and clinically significant variants like in BRCA1, without task-specific fine-tuning. The model's training required a significant hardware investment, utilizing over 2,000 NVIDIA H100 GPUs on Amazon Web Services for several months. This collaboration between the Arc Institute and NVIDIA highlights the growing convergence of big tech infrastructure and academic life sciences research. The model and its tools are accessible through platforms like NVIDIA BioNeMo. Practical applications are already emerging from the model's pre-release. Researchers have used Evo 2 to predict genetic risk factors for Alzheimer's, design synthetic bacteriophages to combat antibiotic-resistant bacteria, and generate entire functional phage genomes. Its generative capabilities extend to designing specific DNA elements that could improve the precision of gene therapies by activating therapeutic genes only in target cell types.