Genomics Alliance Expands with Regeneron
The Alliance for Genomic Discovery is expanding its massive dataset with membership from the Regeneron Genetics Center. The collaboration adds to its 312,000 whole genomes and will soon include proteomic data for 50,000 samples, creating a richer resource for ML engineers in biotech.
The Alliance for Genomic Discovery was launched in 2022 by DNA sequencing leader Illumina and Nashville Biosciences, a subsidiary of the Vanderbilt University Medical Center (VUMC). The project's foundation is VUMC's BioVU, a massive biobank containing more than 250,000 de-identified DNA samples linked to extensive, longitudinal clinical data. Regeneron Genetics Center (RGC) joins a roster of pharmaceutical giants including founding members AbbVie, Amgen, AstraZeneca, Bayer, and Merck. A primary goal of the consortium is to improve the diversity of genomic data to create more equitable research, with one cohort prioritizing samples from participants of African ancestry. The addition of RGC is significant given its own large-scale efforts; the center has a database of nearly 3 million sequenced exomes linked to de-identified health records. RGC's stated goal is to leverage this data to find protective genetic factors that can point to the next generation of high-confidence drug targets. The inclusion of proteomics data marks a critical evolution for the dataset. While genomics provides the genetic blueprint, proteomics studies the function and abundance of proteins. For ML engineers, this multi-omics approach enables the development of models that can connect genetic variations directly to their functional impact on biological processes. Specifically, the Alliance is creating a new dataset of 50,000 whole genomes with paired proteomic data. Members like GSK and Amgen are among the first to participate in this expansion, which aims to better understand the molecular mechanisms of diseases. This enriched dataset, combining whole-genome sequences, deep clinical data, and now protein-level information, creates a powerful resource for building sophisticated AI/ML pipelines. The data is structured for complex tasks like identifying novel drug targets, predicting disease risk, and designing more targeted therapies. The project is notable for its speed, having sequenced 250,000 whole genomes since January 2023, a pace significantly faster than previous large-scale genomics initiatives. This rapid scaling highlights the advanced data processing and analytics infrastructure, like Illumina's DRAGEN platform, underpinning the Alliance's operations.