Study Explores Use of Synthetic Data for Cancer Research
An article in Nature Reviews Cancer explores the use of AI-generated synthetic data for cancer research and clinical trials as a way to preserve patient privacy. While synthetic data enables model development and insight sharing without exposing patient details, the analysis notes that regulatory frameworks are still evolving. The article stresses that such data must be validated for clinical utility and bias before use in decision-making.
- Under the EU's GDPR, the creation of synthetic data from personal health information is considered a form of data processing, meaning it is not automatically exempt from privacy regulations and requires safeguards similar to anonymized data. The upcoming EU AI Act also identifies synthetic data as a privacy-preserving technique for high-risk AI systems but does not yet define specific privacy thresholds. - Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are common deep learning models used to create synthetic medical data that mimics the statistical properties of real patient information without revealing individual identities. These models learn the patterns from actual data to generate new, artificial data points. - A key challenge is that synthetic data can inherit and even amplify biases present in the original datasets. If a source dataset underrepresents certain demographic groups, the synthetic version will likely magnify these disparities, potentially leading to AI models that perform poorly for those populations. - To ensure quality, synthetic data undergoes rigorous validation, including statistical comparisons to the original data, reviews by clinical experts to ensure plausibility, and testing AI models trained on synthetic data against real-world data to measure performance. - Companies like Medidata are developing platforms, such as its "Simulants" technology, to create synthetic data from historical clinical trials sponsored by multiple organizations. This approach aims to create balanced and representative datasets for designing new trials and predicting patient outcomes. - In the United States, there is no single comprehensive policy governing synthetic data; its use is addressed under existing frameworks like HIPAA for health data and FDA guidance for medical devices and clinical trials. The FDA has expressed cautious optimism, particularly for preclinical use to reduce animal testing, but emphasizes the need for rigorous validation. - Synthetic data can significantly accelerate drug development by creating virtual patient populations for preclinical research and by simulating clinical trials to optimize protocols before enrolling real subjects. This can help identify safety signals and efficacy early on, potentially reducing development timelines. - One major risk in using synthetic data is "synthetic trust," which is an unwarranted confidence in models trained on artificial datasets that may not preserve clinical validity or reflect real-world demographic realities. This can lead to the development of unreliable healthcare decision-making tools.