New AI Models Genetic Code Across Life

Researchers, including professors from UCSF, have developed Evo 2, an AI model that can model and design genetic code across all domains of life. The model can generate synthetic proteins by understanding the complex relationships within DNA sequences. This represents a significant leap in synthetic biology and could accelerate the creation of novel therapeutics and biomaterials.

Evo 2 was developed by a multi-institutional team from the Arc Institute, NVIDIA, Stanford University, UC Berkeley, and UC San Francisco. This biological foundation model was trained on a massive dataset of 9.3 trillion nucleotides from over 128,000 genomes, spanning bacteria, plants, humans, and even extinct species. The model is fully open-source, with its parameters, and training and inference code publicly available. The model's architecture, a StripedHyena variant, allows it to process DNA sequences up to 1 million nucleotides long, a significant increase over its predecessor, Evo 1. With 40 billion parameters, Evo 2 is the largest AI model for biology to date. This scale enables the model to identify long-distance relationships between different parts of a genome. Evo 2 can predict the function of genes with zero-shot learning and generate novel genetic sequences for synthetic proteins and even entire mitochondrial genomes. In early tests, it demonstrated over 90% accuracy in classifying the cancer risk of BRCA1 gene variants. The model has also been used to design completely new antitoxin proteins with no resemblance to known structures. The platform is integrated into the NVIDIA BioNeMo framework, which aims to accelerate scientific discovery. This integration facilitates the design of novel biomaterials and therapeutics by allowing researchers to generate and test new biological sequences in a virtual environment. This has the potential to significantly reduce the time and cost of R&D in the pharmaceutical and biotech industries. The development of models like Evo 2 signals a shift towards the integration of multi-omics data, combining genomics with proteomics, transcriptomics, and other data types for a more holistic understanding of disease biology. This requires robust data architecture capable of handling petabyte-scale datasets. As these technologies mature, they will increasingly rely on multi-omic clinical platforms (MCPs) to translate complex data into actionable clinical insights.

New AI Models Genetic Code Across Life

Get your own daily briefing