Data harmonization framework

Harvard DBMI described a framework that harmonizes multi‑hospital data using statistics, knowledge graphs and large language models to create shared semantics across systems. The approach is presented as a technical pathway for analysts to align disparate EHR and sensor data for multi‑site work. (x.com)

Hospitals often record the same clinical fact in different ways, and a Harvard Medical School team said it has built a framework to translate those local codes into a shared meaning without moving patient-level records across sites. (nature.com) The paper, published April 3, 2026 in *Nature Communications*, describes a graph-based system that combines institution-level summary statistics from electronic health records, curated biomedical knowledge graphs, and semantic signals from large language models. The authors said they tested it across seven institutions in the United States and France and across two languages. (nature.com) Electronic health records are digital charts, but each hospital often uses its own local vocabulary for diagnoses, lab tests, and procedures. A 2022 paper from many of the same researchers said those coding differences block “semantic interoperability,” meaning data can move between systems but lose their meaning in translation. (pubmed.ncbi.nlm.nih.gov) The new framework treats harmonization as a representation-learning problem, which means it places related medical concepts near each other in a shared mathematical map. Instead of relying only on fixed standards or manual code matching, it learns from how codes co-occur inside each health system and from text-based descriptions of those codes. (nature.com) Knowledge graphs are structured maps of facts and relationships, like linking a drug to a disease or a lab test to a body system. Large language models add another layer by reading the text attached to codes and helping infer when two sites are describing the same thing with different labels. (nature.com; arxiv.org) The privacy piece is central to the pitch. The authors said the system uses institution-specific summary data rather than patient-level records, a design aimed at letting hospitals collaborate without pooling raw charts in one place. (nature.com) That approach extends earlier work from the same research line. In 2022, the group introduced a Multiview Incomplete Knowledge Graph Integration method that combined code co-occurrence patterns and text embeddings from Self-Aligning Pretrained BERT, or SAPBERT, to translate between partially overlapping hospital code systems. (pubmed.ncbi.nlm.nih.gov) Harvard’s Department of Biomedical Informatics has framed this kind of work as part of a broader effort to integrate and interpret complex biomedical data for research and care. The department says its mission includes turning large biomedical data sets into computational tools for precision medicine. (dbmi.hms.harvard.edu) Other groups are also trying to automate harmonization. A 2025 study led by Harvard researchers described SONAR, a method that matched variables across National Institutes of Health cohorts by combining the wording of variable descriptions with the statistical distribution of participant data. (pmc.ncbi.nlm.nih.gov) The practical promise is narrower than a universal medical record. The April 2026 paper presents a way for analysts to align site-specific vocabularies well enough to train and deploy models across different health systems while keeping the local data where it already sits. (nature.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.