Open-Source 'Living Datasets' Emerge

New platforms are enabling the creation of 'living datasets'—continuously updated and extensible data streams generated by autonomous labs. As an example, open-source microbial genotype-to-phenotype datasets are being built from automated experiments. These resources can be used to benchmark process performance, validate AI/ML models, and identify new optimization strategies in biomanufacturing.

- The "living dataset" concept parallels the development of "digital twins" in biomanufacturing, where a dynamic virtual model mirrors a physical process to simulate, predict, and optimize operations without disrupting production. This approach is being adopted in GMP environments to enhance process control and accelerate validation timelines by 40-70%. - A primary challenge these open datasets address is the issue of data silos and lack of standardization in biopharma, which hampers the ability to gain holistic insights from fragmented information across different departments and systems. Integrating data from varied sources like LIMS and MES is a significant hurdle in achieving the goals of Biopharma 4.0. - AI and machine learning models trained on these continuously updated datasets can reduce the number of required experimental runs by 30-50% and significantly decrease batch failures. These models optimize critical process parameters such as temperature, pH, and media composition to maximize yield and product quality. - For viral vector manufacturing, companies like AGC Biologics are creating standardized "plug and play" process platforms (BravoAAV™ and ProntoLVV™) that streamline the transition to GMP manufacturing by using common equipment, reagents, and process steps. This templated approach accelerates timelines for cell and gene therapy development. - The U.S. Food and Drug Administration (FDA) requires up to 15 years of follow-up data for gene therapy products, creating a massive need for robust, long-term data collection infrastructure that shared platforms could help sustain. - Standardization of data formats and processes is a critical bottleneck in the cell and gene therapy field; initiatives are underway to harmonize standards for everything from manufacturing processes to the entire "vein-to-vein" supply chain to ensure product consistency and patient access. - Major technology players and research consortiums are releasing open-source AI models and benchmark datasets to accelerate biological research, such as the Chan Zuckerberg Initiative's CELLxGENE and various models from Biohub that can interpret microscopy images or predict gene expression from DNA sequences. - The implementation of automation is a key enabler for generating these datasets, with robotics and automated systems for cell culture, single-cell seeding, and quality assays running 24/7 to eliminate manual variability and increase data throughput.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.