IBM’s Docling turns docs into structured data
IBM released Docling, a free Python library that converts documents—PDFs, images and similar—into structured data to simplify preprocessing for data projects. That removes a common early bottleneck in ML pipelines and makes it easier to build reproducible ingestion steps for analytics or models. For projects that rely on messy sources, Docling can speed up dataset creation and let students focus on feature engineering and evaluation. (x.com)
# IBM’s Docling turns documents into structured data Most data projects do not begin with modeling. They begin with cleanup. A team might have a folder full of annual reports, scanned forms, slide decks, manuals, or research papers, but those files are built for people to read, not for software to use. A portable document format file can show a table perfectly on screen while still hiding the table’s structure from a machine. A scanned image can contain a paragraph that looks obvious to a person and still be unreadable to a script unless optical character recognition is added first. That gap between “human-readable” and “machine-readable” is where a large share of analytics and machine learning time disappears. IBM’s Docling is aimed at that exact problem. Docling is a free, open-source Python library that converts documents into structured, machine-readable representations, with support for formats including portable document format files and images. IBM Research introduced it as a toolkit for converting things like PDFs, manuals, and slide decks into data that developers can use in downstream artificial intelligence workflows. (research.ibm.com, github.com, docling-project.github.io) ## What Docling actually does At a basic level, document conversion sounds simple: open a file and extract the text. In practice, that usually fails. Documents contain headings, page furniture, figures, captions, tables, formulas, reading order, and nested sections. A plain text dump often mixes these elements together, strips away hierarchy, and destroys the relationships that made the original document useful. Docling’s pitch is that it preserves more of that structure. Its documentation says the library parses diverse formats, includes advanced portable document format understanding, and produces a unified representation called `DoclingDocument`. That representation can express text, tables, pictures, document hierarchy, page layout information, and distinctions between the main body and headers or footers. (docling-project.github.io, docling-project.github.io) The project also emphasizes details that matter in real preprocessing pipelines. The public site says Docling can detect tables, formulas, reading order, and optical character recognition output. IBM’s Granite Docling model page describes a related model that parses PDFs, slides, and scanned pages directly into structured, machine-readable formats rather than relying on a loose chain of disconnected extraction steps. (docling.ai, ibm.com) In other words, Docling is not just trying to “get the words out.” It is trying to reconstruct enough of the document’s logic that the result can feed search systems, analytics pipelines, retrieval systems, or model training workflows without a large manual cleanup pass. ## Why this matters in machine learning pipelines For many teams, document preprocessing is a hidden bottleneck. A machine learning pipeline often looks clean in a diagram: ingest data, transform it, train a model, evaluate results. But if the source material is a stack of messy documents, the ingestion step becomes the hardest part. Someone has to decide how to extract text, how to handle scanned pages, how to preserve tables, how to split sections, and how to make the process reproducible so the same raw files produce the same outputs next week. That reproducibility point matters. If a team hand-cleans documents in notebooks or one-off scripts, the dataset becomes hard to regenerate. A library that standardizes ingestion makes it easier to rerun the same process across new files, compare versions, and debug errors when model performance changes. Docling is designed for that kind of repeatable workflow. Its quickstart documentation shows a Python interface centered on a `DocumentConverter`, and the project supports command-line use as well. The package on Python Package Index also shows straightforward conversion into exports such as Markdown from either local files or web-hosted documents. (docling-project.github.io, pypi.org) That makes the tool relevant beyond large enterprise teams. In a classroom, a capstone project, or a small research lab, students often spend disproportionate time wrestling with raw files before they can test any actual modeling ideas. A tool that automates more of the extraction and structuring work can shift effort toward feature engineering, error analysis, and evaluation. ## What is open-source here, and how active is it? Docling is not a small demo project. The main repository is public on GitHub under the Massachusetts Institute of Technology license, and the Python Package Index listing also identifies the codebase as MIT-licensed. The project is hosted under the Linux Foundation AI & Data umbrella, according to both the project site and repository pages. (github.com, pypi.org, github.com) The repository appears highly active. As of April 9, 2026, the GitHub page shows more than 57,000 stars and thousands of forks, with recent commits landing within hours or days. That does not guarantee quality by itself, but it does suggest the library has moved well past the stage of an obscure internal release. (github.com) The surrounding ecosystem is also expanding. The project organization includes related repositories for serving Docling as an application programming interface, model packages, Java integration, and synthetic data generation workflows. (github.com, github.com) ## What kinds of output and workflows does it support? Structured data can mean different things depending on the job. For one team, it means preserving section headings and paragraphs so a retrieval system can cite the right passage. For another, it means extracting tables into a format that can be loaded into a dataframe. For another, it means producing a consistent schema from invoices or reports. Docling appears built to cover several of those paths. Its examples and documentation include conversion workflows, retrieval-augmented generation integrations, and information extraction against a user-defined schema. The extraction docs say a user can provide a schema template as either a dictionary or a Pydantic model, and Docling will return standardized output organized by page. (docling-project.github.io, docling-project.github.io) That is a useful distinction. Some document tools stop at conversion. Docling is also positioning itself as a bridge from unstructured source files to typed outputs that fit directly into software systems. ## The technical angle behind the release IBM Research framed Docling as a way to unlock enterprise document data for generative artificial intelligence. The company’s announcement described it as a toolkit for creating specialized data from PDFs and similar business documents so developers can customize models and ground them on trusted information. (research.ibm.com) A later technical report on arXiv describes Docling as an efficient open-source toolkit for artificial-intelligence-driven document conversion that can parse multiple popular formats into a unified and richly structured representation. (arxiv.org) That framing matches a broader shift in machine learning infrastructure. As large language models spread through enterprise software, the limiting factor is often not the model itself. It is the quality of the source material feeding the system. If contracts, manuals, financial reports, and scanned forms remain trapped in awkward formats, even strong models have less useful context to work with. ## What to watch next The practical test for tools like Docling is not whether they can parse a clean sample file. It is whether they hold up on the ugly stuff: low-quality scans, inconsistent layouts, tables split across pages, mixed text and images, and documents created by many different software systems over many years. Docling’s recent development activity suggests the project is still improving quickly, with ongoing additions to model support and serving infrastructure. (github.com, github.com) If it keeps gaining adoption, its appeal will be simple: fewer custom parsers, fewer brittle preprocessing scripts, and a faster path from messy documents to usable datasets. For data teams, that can remove one of the least glamorous but most time-consuming parts of the pipeline. For students and smaller builders, it can mean spending less time fighting file formats and more time working on the parts of machine learning that actually change results.