Study on unreliable data in clinical prediction research

- Alexander D. Gibson and co-authors posted a February 26, 2026 medRxiv preprint documenting unreliable Kaggle health datasets used in clinical prediction research. - The paper says two stroke and diabetes datasets fed peer-reviewed models, with three showing evidence of clinical use and 86 review articles citing them. - The preprint remains on medRxiv for scrutiny, and related papers have already drawn retractions from journals including Scientific Reports.

Alexander D. Gibson and three co-authors posted a medRxiv preprint on February 26, 2026, arguing that some clinical prediction research has been built on datasets with unclear or unreliable origins. The paper examined two widely used public Kaggle datasets — one on stroke and one on diabetes — and said both lacked clear data provenance despite appearing in peer-reviewed model-development studies. The authors said those models were not confined to academic exercises: some showed signs of use in clinical practice, one was cited in a medical-device patent, and many were pulled into review articles. medRxiv labels the paper as a preprint that has not been peer reviewed and says it should not be used to guide clinical practice. ### Which datasets are at the center of the paper? The preprint identifies two large public Kaggle datasets, covering stroke and diabetes prediction, as the core examples in the study. Gibson, Nicole M. White, Gary S. Collins and Adrian G. Barnett wrote that the datasets lacked verifiable provenance but were still used to develop and validate clinical prediction models in published papers. (medrxiv.org) The authors said the problem was not only that the datasets were public, but that their origin and construction could not be clearly established. In the paper’s abstract, they said prediction models based solely on simulated or fabricated datasets should never be used to directly inform patient-care decisions. ### How far did those datasets spread into the literature? The paper says models built on the two datasets spread widely through the research record. (medrxiv.org) The authors reported that three prediction models had evidence of use in clinical practice, one model was cited in a medical-device patent, and the models were cited in 86 review articles. Nature reported on April 15, 2026, that dozens of AI disease-prediction models had been trained on dubious data designed to predict diabetes or stroke risk. (medrxiv.org) That report said the affected work had moved beyond isolated papers and into a broader set of published models. ### Why does data provenance matter in clinical prediction work? Clinical prediction models are used to estimate diagnosis or prognosis, and the preprint says unreliable source data can distort those estimates before a model ever reaches a clinic. (medrxiv.org) The authors wrote that prediction models should be developed with appropriate data, robust methods and transparent reporting so decisions are based on reliable predictions. (nature.com) The BMJ published updated TRIPOD+AI reporting guidance in 2024 for studies developing or evaluating clinical prediction models using regression or machine-learning methods. That guidance sets minimum reporting recommendations, underscoring the field’s focus on transparency in how models are built and described. ### Have journals or publishers acted on papers tied to the datasets? Scientific Reports published a retraction note last month for a paper on stroke prediction, and Retraction Watch reported on May 18 that the preprint had already contributed to several retractions involving the questionable datasets. (medrxiv.org) Retraction Watch also said 11 of the papers using the datasets were published in Springer Nature journals. The Retraction Watch report quoted Gibson saying one Scientific Reports paper was easy to locate after the team had reviewed many questionable datasets. (bmj.com) That report described the underlying datasets as “comically bad,” attributing the phrase to researchers involved in the review. ### What did the authors say should happen next? The authors recommended that journals and data repositories require data-provenance reporting as a condition of publication or sharing. (nature.com) Their GitHub repository for the project is public and describes the work as a research effort examining data provenance in published clinical prediction models. The preprint remains available on medRxiv under DOI 10.64898/2026.02.24.26347028, and the University of Birmingham research page lists it as published on February 26, 2026. (retractionwatch.com) The social-media post that renewed attention to the paper was published on May 18, 2026, pointing readers to the study for review and scrutiny. (research.birmingham.ac.uk) (medrxiv.org)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.