vLLM: cheap, large‑scale OCR
vLLM reported an OCR run over 27,000 arXiv papers using a 5B model on NVIDIA L40S GPUs that cost about $850 and completed in roughly 29 hours without crashes. (x.com) The demonstration highlights a low-cost batch inference pattern for large document collections on GPU instances. (x.com)
Optical character recognition is the step that turns page images into machine-readable text, and vLLM said it ran that process across 27,000 arXiv papers for about $850. (x.com) The project said the job used a 5 billion-parameter model on NVIDIA L40S graphics processors, finished in about 29 hours, and did not crash during the run. (x.com) vLLM is an open-source inference engine for large language models, and its documentation includes offline batch inference workflows built around Ray Data, which shards datasets and load-balances work across a cluster. (docs.vllm.ai) The software was built to keep more requests in memory at once. The University of California, Berkeley team behind vLLM said its PagedAttention design cuts waste in the key-value cache, the working memory large language models use while generating tokens. (sky.cs.berkeley.edu) That matters for document jobs because optical character recognition on research papers is usually a batch problem, not a chatbot problem: thousands of files, long inputs, and a need to keep graphics processors busy instead of waiting between requests. (docs.vllm.ai) The hardware choice also fits that pattern. NVIDIA says the L40S is a data-center graphics processor aimed at inference and lists it as part of its lineup for generative artificial intelligence and large language model workloads. (nvidia.com) A 5 billion-parameter model is small enough to be cheaper to run than the largest vision-language systems, but still large enough to parse page layouts, equations, and multi-column text that break older optical character recognition tools. Allen Institute for Artificial Intelligence’s olmOCR project, another document-parsing toolkit, says modern vision models can convert PDFs into clean Markdown and handle equations, tables, handwriting, and reading order. (github.com) vLLM has been pushing beyond online chat serving into offline jobs. PyTorch’s project page says the system is designed for both OpenAI-compatible servers and offline batch inference, and can scale to multi-node runs. (pytorch.org) The immediate result is less about one arXiv scrape than about a template: rent inference hardware, queue a large document set, and process it in one pass without building a custom serving stack. (x.com)