LlamaIndex open‑sources ParseBench
LlamaIndex released ParseBench, the first OCR benchmark aimed at agentic document parsing across tables, charts and enterprise documents, and reported LlamaParse leading at 84.9% on the benchmark. The benchmark is positioned as a way to measure real‑world parsing accuracy for agents that need to read complex documents. (x.com)
LlamaIndex has released ParseBench, an open-source benchmark for testing whether artificial intelligence systems can read complex business documents well enough for software agents to act on them. (llamaindex.ai) The company published the benchmark on April 13, 2026, alongside a paper posted to arXiv on April 9, 2026. LlamaIndex said ParseBench covers about 2,000 human-verified pages and more than 167,000 test rules across 14 parsing methods. (llamaindex.ai) (arxiv.org) Document parsing is the step that turns a Portable Document Format file, scan, chart, or spreadsheet into text or structured data a model can use. LlamaIndex said its top LlamaParse configuration, called Agentic, scored 84.9% overall on ParseBench. (developers.llamaindex.ai) (arxiv.org) The benchmark tests five failure points that can break automation: tables, charts, content faithfulness, semantic formatting, and visual grounding. In plain terms, it checks whether a parser keeps the right rows, numbers, formatting cues, and on-page references instead of just producing text that looks similar. (llamaindex.ai) (arxiv.org) That is the gap LlamaIndex says older benchmarks often miss. The paper argues that common text-similarity measures can overlook errors like a shifted table header, a dropped footnote, or a chart reduced to raw text, even when those mistakes would cause an agent to make the wrong decision. (llamaindex.ai) (arxiv.org) ParseBench is built from enterprise documents in insurance, finance, and government rather than mostly academic papers or web pages. The GitHub repository says the goal is to measure whether parsed output preserves the structure and meaning needed for autonomous decisions. (github.com) LlamaIndex also made the benchmark artifacts public. The company said the dataset is hosted on Hugging Face, the evaluation code is on GitHub, and users can run quick tests locally with prebuilt pipelines or evaluate their own parser. (llamaindex.ai) (github.com) (huggingface.co) The paper does not claim the field is solved. It says no single method was consistently strong across all five dimensions, even though LlamaParse Agentic posted the highest overall score in the reported results. (arxiv.org) For companies building agents that read contracts, claims, filings, and reports, ParseBench gives them a public way to test one basic question: whether the system read the document correctly before it starts making decisions from it. (arxiv.org)