Data Pipelines Called 'Sewer Pipes'

Clinical Architecture CEO Charlie Harp claims AI is exposing a 37-year-old problem: healthcare data quality is not good enough. On the *Practical AI in Healthcare* podcast, he said that until AI, organizations didn't realize their data pipelines were "sewer pipes, not water pipes." He's advocating for the PIQI framework (Availability, Accuracy, Conformance, Plausibility) to systematically assess data *before* it gets to an AI model.

Charlie Harp's "sewer pipes" analogy highlights a long-standing issue; data quality problems in healthcare were identified as early as 1977. Past studies have shown the accuracy or completeness of healthcare data ranges from 60% to 90%. This historical context underscores that AI is not creating a new problem, but rather magnifying the consequences of existing poor data infrastructure. The PIQI (Patient Information Quality Improvement) framework is an open-source initiative designed to create a standardized method for evaluating the quality of patient-centric healthcare information. Currently undergoing the balloting process with HL7 to become a standard, PIQI assesses data against models like the United States Core Data for Interoperability (USCDI) v3. It generates scorecards to provide insights into data quality issues, allowing data sources to make necessary improvements. For biotech SaaS firms, the "garbage-in, garbage-out" principle is a major obstacle to deploying effective AI and LLM applications. Fragmented data across lab instruments, databases, and external sources creates silos that impede comprehensive analysis. To prepare for AI, organizations must consolidate data into unified repositories and implement robust data pipelines (ETL/ELT) to minimize manual preprocessing. A modern data architecture ready for AI and LLM integration requires more than traditional data warehouses. It necessitates a robust metadata layer and a semantic layer to provide context for LLMs, enabling them to understand business-specific terms. Furthermore, the architecture must natively support unstructured data and incorporate vector databases or embedding indexes to enable Retrieval-Augmented Generation (RAG), a technique that grounds LLM outputs in up-to-date, specific information. Emerging technologies like the Model Context Protocol (MCP) are becoming crucial for creating interoperable and scalable AI systems. MCP, an open standard, allows AI models to seamlessly connect with disparate external data sources and tools through a common protocol, reducing the need for custom integrations. For biotech, MCP can connect LLM-based assistants to specialized databases for literature, genomics, and clinical trials, transforming R&D workflows. This standardized approach can also enhance data security by centralizing access control. The business case for improving data quality extends beyond risk mitigation to significant efficiency gains and innovation opportunities. High-quality, interoperable data allows clinical researchers and data analysts to shift their focus from data cleaning to higher-value activities like study design and meaningful analytics. This foundational investment enables greater vendor flexibility, simplifies partnership integrations, and accelerates the adoption of new technologies.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.