Training data problems found

A survey reported dozens of disease‑prediction models were trained on questionable or poorly documented datasets, raising concerns about their real‑world reliability. The analysis described models for conditions like diabetes and stroke that may have been built on unrepresentative or low‑quality source data, with implications for external validation and clinical claims (nature.com).

A disease-prediction model is only as trustworthy as the patient data used to train it, and a new survey found 124 published models built on two datasets with unverifiable origins. (nature.com) The underlying idea is simple: researchers feed a model rows of patient information, such as age, blood sugar and smoking history, so it can estimate the chance of a condition like stroke or diabetes. For clinical prediction work, reporting rules called Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis plus Artificial Intelligence, or TRIPOD+AI, say authors should explain where the data came from and how they were collected. (bmj.com) In the new medRxiv preprint, statisticians Alexander D. Gibson, Nicole M. White, Gary S. Collins and Adrian G. Barnett examined two public Kaggle datasets, one on stroke and one on diabetes, and said both lacked basic provenance details, including when, where, why and how the records were gathered. The authors wrote that the datasets’ authenticity “could not be verified” and showed signs they were “likely to be simulated or fabricated.” (medrxiv.org) The preprint says those two datasets were used in 124 peer-reviewed clinical prediction model studies. It also found evidence that three prediction models reached clinical practice, one model was cited in a medical-device patent, and the papers were cited in 86 review articles. (medrxiv.org) The scale matters because clinical prediction models are no longer a niche exercise: the preprint cites an estimate of nearly 250,000 such models published through 2024. These tools are used to support diagnosis and prognosis, so weak training data can carry bad assumptions into later studies, reviews and product claims. (medrxiv.org) Nature reported on April 15, 2026, that some of the flagged models were designed to predict stroke or diabetes risk and that a few might already have been used on patients. The same report said at least two journals were investigating studies that used the datasets. (nature.com) The researchers said the stroke dataset had 5,110 records and the diabetes dataset had 100,000. They also reported anomalies that do not fit ordinary real-world health data, including unusually complete records and diabetes blood-glucose values collapsing into just 18 distinct numbers. (nature.com) The preprint says practical recommendations appeared in 68% of the stroke papers and 80% of the diabetes papers, even though the source data were poorly documented. The authors recommended that journals and repositories require data-provenance reporting and said models based only on simulated or fabricated datasets should not be used to guide patient care. (medrxiv.org) The immediate question is not whether machine learning can help medicine, but whether published models can show their work. In this case, the warning came before peer review was complete, and it landed after the suspect datasets had already spread through papers, reviews and at least some real-world use. (medrxiv.org)

Training data problems found

Get your own daily briefing