Databricks Adds AI Document Parsing to SQL
Databricks is rolling out `ai_parse_document`, a new SQL function that uses LLMs to extract structured data from unstructured documents like PDFs and clinical notes. The feature is in public preview and is HIPAA compliant, targeting a major bottleneck in healthcare analytics pipelines by automating ingestion and processing of sensitive data.
The `ai_parse_document` function is a key component of Databricks' Agent Bricks framework, designed to unlock the estimated 80% of enterprise data trapped in unstructured formats like PDFs and reports. This move aims to make complex documents a native, queryable data type directly within the lakehouse, eliminating the need for separate, brittle processing pipelines. Unlike traditional OCR which just extracts text, this function captures a document's full structure—preserving tables as HTML, identifying figures with AI-generated descriptions, and retaining spatial metadata like bounding boxes. The comprehensive output is stored in Unity Catalog, allowing documents to be governed and searched with the same tools as structured data. For data platform architecture, the function integrates directly with Lakeflow's Spark Declarative Pipelines. This enables scalable, incremental processing at production volumes; as new documents land in cloud storage from sources like SharePoint or S3, they are automatically parsed without custom orchestration code or reprocessing existing files. This release intensifies the rivalry with Snowflake, whose Cortex platform offers a similar document parsing function. Databricks is competing on cost and performance, claiming its agentic system for multimodal understanding delivers comparable or better quality at a 3-5x lower price point than leading alternatives. For analytics engineers, this single SQL function replaces complex ETL jobs that historically relied on a patchwork of OCR libraries, regular expressions, and custom scripts. This shift simplifies the data ingestion bottleneck, particularly for challenging formats in regulated industries, such as processing FDA Complete Response Letters in life sciences. The function is foundational for building more effective AI agents and Retrieval-Augmented Generation (RAG) applications. By providing structured, high-fidelity context from business documents, it improves the accuracy of vector search and gives LLMs a richer understanding of enterprise knowledge. Processing occurs within the Databricks security perimeter, a critical feature for its HIPAA compliance. This allows healthcare organizations to build analytics pipelines on sensitive data like clinical notes, knowing that Protected Health Information (ePHI) is handled within a secure and compliant environment. The function supports a range of formats including PDF, DOCX, PPTX, and image files (JPG, PNG). It can be combined with other AI functions like `ai_query` to perform subsequent extraction on the parsed content, allowing teams to build sophisticated, multi-step document intelligence workflows entirely within SQL.