Indexing beats pure vectors

Benchmarks and new repos show that hybrid and tree‑index approaches can outperform straight vector search for real RAG tasks — Weaviate’s multimodal hybrid led PDF RAG metrics, and PageIndex’s tree approach reports 98.7% on a finance benchmark. (x.com) (x.com). A Baseten case study also says Hebbia cut latency 4× and costs 10× for finance search, suggesting engineering choices around indexing and retrieval layers can beat raw model tweaks for product outcomes. (x.com)

Indexing is having a quiet comeback in artificial intelligence search. For the last two years, the default recipe for retrieval-augmented generation has been simple: split documents into chunks, turn each chunk into a vector, and use nearest-neighbor search to fetch the most similar text for a large language model. That approach works well enough to become the industry baseline. But a new set of benchmarks, repositories, and case studies suggests the baseline is not the finish line. In real document-heavy systems, especially portable document format files and finance workflows, the retrieval layer itself is becoming the main performance lever. (arxiv.org) The core problem is that vector search is a blunt instrument for structured documents. A vector is a compressed numerical fingerprint of meaning. That is useful when a user asks for a broad concept like “revenue guidance” or “termination clause.” But long business documents are not just bags of meanings. They have page layouts, tables, headings, footnotes, charts, and section hierarchies. When those structures get flattened into generic chunks, the system often retrieves text that is semantically nearby but operationally wrong. A paragraph can sound related while missing the exact row in a table or the figure on a page that actually answers the question. (github.com) That gap is why retrieval-augmented generation, often shortened to RAG, has become less about “which model” and more about “which index.” In a RAG system, the model usually does not know the company’s private documents, filings, manuals, or reports by itself. The system first retrieves supporting material, then asks the model to answer using that material. If retrieval misses the right evidence, a stronger model often just produces a more fluent mistake. Better indexing can therefore beat better generation, because it changes what evidence the model ever sees. (arxiv.org) One branch of that shift is hybrid retrieval. Hybrid systems combine multiple signals instead of relying only on vector similarity. A query can be matched through semantic embeddings, keyword overlap, metadata filters, page structure, or modality-specific features for images and tables. The point is not to replace vectors entirely. The point is to stop asking one signal to do every job. In documents where exact terms, layout clues, and visual elements all matter, hybrid retrieval can recover evidence that pure vector search misses. (weaviate.io) Weaviate is one of the clearest examples of that trend. Weaviate has been pushing multimodal retrieval, where text and images can live in the same search workflow rather than being forced into text-only preprocessing. Its recent materials on multimodal retrieval-augmented generation and multi-vector portable document format search describe systems that index both textual and visual content, including figures inside documents. In Weaviate’s framing, the gain comes from preserving more of the original document instead of throwing away structure during chunking. (weaviate.io) That matters because portable document format files are where many retrieval systems break. A portable document format file is not a clean database row. It is a visual container. Tables may not read left-to-right after extraction. Captions can drift from images. Important numbers can sit in a chart, not in the body text. The Open RAG Benchmark project is built around exactly this problem, describing a multimodal portable document format evaluation set with text, tables, and images drawn from arXiv documents. In that setting, multimodal and multi-vector approaches have a natural advantage because they are designed to search more than plain text chunks. (github.com) Another branch of the shift goes further and questions vector databases altogether. PageIndex is a document indexing approach that organizes material into a tree-like structure and retrieves by navigating that hierarchy instead of doing straight embedding similarity search. The idea is closer to how a person reads a long filing: start with the table of contents, choose the relevant section, then drill down to the exact page or subsection. That is slower to explain than “nearest vectors,” but it maps better to documents that already have strong internal structure. (github.com) The headline number attached to PageIndex is hard to ignore. Several public repositories describing PageIndex say a reasoning-based system powered by the method reached 98.7 percent accuracy on FinanceBench, a benchmark for financial document question answering. Those repositories also explicitly position the result as outperforming traditional vector-based retrieval-augmented generation on that task. Because the claim currently appears in repositories rather than a polished peer-reviewed paper, it should be read as a strong directional signal rather than a settled industry standard. But even as a directional signal, it is notable. (github.com) Finance is exactly where this kind of retrieval change should show up first. Financial research is unusually sensitive to precision. A missed footnote in a quarterly filing can matter more than a beautifully summarized paragraph. Analysts often need one exact debt covenant, one table cell, one guidance number, or one accounting exception. That makes finance a harsh test for chunk-and-embed pipelines, because “similar enough” is often not enough. Benchmarks focused on financial texts and tables have been emerging for that reason, including text-and-table retrieval work and multimodal finance retrieval benchmarks that emphasize visual evidence and traceability. (huggingface.co) The commercial evidence points in the same direction. Baseten’s customer case study on Hebbia does not claim that indexing alone caused the gains, but it does show how much product performance can move when the retrieval and inference stack are engineered together. Baseten says Hebbia achieved a 2.5 times increase in tokens per second, a 4 times improvement in time to first token, and more than 10 times lower inference cost after shifting deployment architecture. For buyers of enterprise search tools, those are not lab curiosities. They are the difference between a tool that feels instant and one that feels unusable. (baseten.co) That is the larger lesson in this week’s story. The popular narrative in artificial intelligence still centers on model upgrades: bigger context windows, better reasoning, new multimodal releases. But retrieval systems are increasingly showing that architecture choices underneath the model can produce larger practical gains than swapping one frontier model for another. If the index finds the right page, table, and figure, a smaller or cheaper model can suddenly look much smarter. If the index misses, even the best model is guessing from bad evidence. (baseten.co) That does not mean pure vector search is obsolete. Vectors remain fast, flexible, and very good for many broad semantic retrieval tasks. They are still the easiest starting point for a new retrieval-augmented generation system, and hybrid systems often keep vectors as one component rather than removing them. The change is that teams are becoming less willing to treat vector search as the whole retrieval strategy. In document-heavy products, the winning setup increasingly looks like vectors plus structure, vectors plus metadata, or vectors replaced by a navigation scheme tailored to the document itself. (weaviate.io) So the phrase “indexing beats pure vectors” is less a slogan than a design update. The retrieval-augmented generation stack is maturing. Early systems asked, “How do we bolt private data onto a language model?” Newer systems ask, “What is the right representation of this document before retrieval even starts?” The answers now include multimodal indexes for portable document format files, hybrid retrieval for mixed signals, and tree-based navigation for long structured reports. The common theme is simple: the closer the index matches the shape of the source material, the better the system performs. (github.com) If this trend holds, the next wave of retrieval products will not be won by whoever

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.