OCR at scale: 27K papers on 16 GPUs
vLLM reported an OCR run on 27,000 arXiv papers using the 5B Chandra‑OCR‑2 model across 16 L40S GPUs, achieving roughly 60 papers per GPU per hour and an $850 total cost for the job. The post frames this as a concrete example of scaling inference for large document-processing pipelines using moderate‑sized models and GPU fleets. (x.com)
Optical character recognition is the step that turns a paper’s pages into machine-readable text, and vLLM said it ran that job on 27,000 arXiv papers across 16 NVIDIA L40S graphics processors. (x.com) The post said the run used Datalab’s Chandra-OCR-2, a 5 billion-parameter model, and averaged about 60 papers per graphics processor per hour at a total cost of about $850. (x.com) Hugging Face described the same project in an April 7, 2026 blog post, saying it OCR’ed about 30,000 papers in roughly 29 to 30 hours on 16 parallel L40S instances using vLLM on Hugging Face Jobs. (huggingface.co) The bottleneck was not finding papers. Hugging Face said about 27,000 papers indexed on its site did not have an arXiv HTML page, which blocked its “chat with paper” feature because that system uses paper text converted into Markdown as context for HuggingChat. (huggingface.co) Optical character recognition used to mean reading plain scanned text. New document models try to keep the page structure too, so tables, formulas, checkboxes, images, and multi-column layouts survive the conversion instead of collapsing into a text dump. (huggingface.co 1) (huggingface.co 2) That is the pitch behind Chandra-OCR-2. Its model card says it outputs Markdown, HyperText Markup Language, and JavaScript Object Notation, preserves layout information, supports more than 90 languages, and improves on math, tables, handwriting, and complex layouts. (huggingface.co) Hugging Face said it picked Chandra-OCR-2 because it led the OlmOCRBench leaderboard at the time of writing, and the company said the model’s OpenRAIL license allowed commercial use through frameworks including Transformers and vLLM. (huggingface.co 1) (huggingface.co 2) vLLM’s role was the plumbing for batch inference: loading the model once, feeding it many inputs, and keeping graphics processors busy. Its documentation says it supports offline batched inference, and its Ray Data example describes automatic sharding, load balancing, retries, and continuous batching across clusters. (docs.vllm.ai 1) (docs.vllm.ai 2) The hardware choice also matters. NVIDIA markets the L40S as a data-center graphics processor for generative artificial intelligence inference and other compute-heavy workloads, which puts this run in the category of rented production infrastructure rather than a lab demo on a single card. (nvidia.com) The thread from vLLM lands as open-source OCR is getting more specialized and more deployable: smaller document models, serverless graphics processor fleets, and batch inference software are now being used to process tens of thousands of papers instead of a handful of demos. (x.com) (huggingface.co)