Microsoft’s File→Markdown Tool

Microsoft released an open-source tool that converts files like PDFs, Excel sheets and audio into clean Markdown suitable for language models, removing a common preprocessing headache for LLM pipelines. That utility can simplify building retrieval-augmented systems by standardising inputs for vectorization and prompting. (x.com)

Most language models do not read a PDF the way you do. They see a jumble of text blocks, broken tables, page headers, and footnotes unless someone first turns the file into a cleaner format. (github.com) Markdown is that cleaner format. It is basically plain text with light labels for things like headings, bullet points, links, and tables, so a model can tell what is a title and what is a cell in a spreadsheet. (github.com) Microsoft’s tool is called MarkItDown, and it is open source under the Massachusetts Institute of Technology license on GitHub. The repository describes it as a Python utility built to convert files into Markdown for large language model and text-analysis pipelines. (github.com) The file list is wider than just PDFs. MarkItDown says it can convert PowerPoint presentations, Word documents, Excel spreadsheets, images, audio, HTML pages, comma-separated value files, JavaScript Object Notation files, Extensible Markup Language files, ZIP archives, YouTube URLs, and EPUB books. (github.com) The detail that makes this useful is structure preservation. The project says it tries to keep headings, lists, tables, and links intact, which is the difference between a model seeing an annual report as a document and seeing it as a bag of disconnected words. (github.com) That solves a boring problem that keeps showing up in retrieval systems. Before you can store a document in a vector database or feed it into a prompt, you usually need one ingestion step that turns many file types into one predictable text shape. (github.com) MarkItDown also reaches beyond text documents. Its README says images can be processed with optical character recognition, and audio can be converted with speech transcription, so a scanned receipt or a meeting recording can land in the same Markdown pipeline as a spreadsheet. (github.com) Microsoft’s own explanation for choosing Markdown is unusually direct. The README says mainstream models such as OpenAI’s GPT-4o already “speak” Markdown natively, and it adds that Markdown is token-efficient, which means less formatting overhead inside a model’s context window. (github.com) The project is not a tiny side repo sitting untouched. On April 10, 2026, the GitHub page showed about 94,300 stars, about 5,700 forks, 305 commits, and a latest release tagged v0.1.5 on February 20, 2026. (github.com, github.com) Microsoft has also started wiring it into the newer agent stack. The repository says MarkItDown now has a Model Context Protocol server, which lets an assistant application call one conversion tool and hand it a file or web address without custom glue code for every format. (github.com, github.com) The catch is in the fine print. The README says the output is meant for text-analysis tools rather than pixel-perfect reproduction, so MarkItDown is better thought of as a document shredder that sorts the pieces into neat labeled piles, not as a publishing tool that recreates the original page exactly. (github.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.