LlamaIndex Releases LiteParse
LlamaIndex open‑sourced LiteParse, a model‑free PDF/Office parser that claims to process 500 pages in 2 seconds and handles tables well—designed to plug into Claude and other agents. The tool is pitched as a fast, lightweight ingestion layer for agent workflows and retrieval systems. (x.com)
The run-llama/liteparse GitHub repository currently shows 149 commits and is published under an Apache-2.0 license. (github.com) The README documents a PDF.js-based spatial text parser, a built-in Tesseract.js OCR fallback, support for external HTTP OCR servers (EasyOCR, PaddleOCR), JSON/text output with precise bounding boxes, and page screenshot generation for agent workflows. (github.com) The project ships a CLI named lit and an npm package @llamaindex/liteparse with install instructions (npm i -g @llamaindex/liteparse), plus a Homebrew tap/formula (run-llama/homebrew-liteparse → llamaindex-liteparse) for macOS/Linux installs. (github.com 1) (github.com 2) A dataset_eval_utils subpackage in the repo runs LLM-based QA evaluation to compare text-extraction quality across multiple PDF parsers and specifies Python 3.12+ and an ANTHROPIC_API_KEY for those evaluation workflows. (github.com) The repository layout includes a packages/python directory and a CHANGELOG.md alongside docs and CONTRIBUTING files, indicating maintained Python bindings and an active development/changelog process. (github.com)