Researchers Boost Vector Search on Disaggregated Memory

A new algorithm called d-HNSW demonstrates efficient, large-scale vector search on disaggregated memory architectures. The research addresses a key bottleneck in AI systems by improving a core operation for LLM retrieval and recommendation engines, particularly where HBM and DRAM allocation is constrained.

- Memory is a primary bottleneck and cost driver in scaling AI, with High-Bandwidth Memory (HBM) and DRAM sometimes accounting for 30-40% of total AI system costs in a datacenter. This has created an industry-wide push for architectural solutions that can ease the demand for expensive memory directly coupled to GPUs. - Disaggregated memory is an architectural shift in data centers that separates compute and memory into independent pools connected by a high-speed network, allowing resources to be scaled more flexibly and efficiently. This approach contrasts with monolithic designs where memory capacity is fixed to a specific server or compute instance. - The d-HNSW algorithm was developed by researchers from the University of California, Santa Cruz, and is the first vector search engine designed specifically for disaggregated memory systems that use Remote Direct Memory Access (RDMA). RDMA allows compute nodes to access remote memory directly, bypassing the CPU on the memory node, which is crucial for low-latency operations. - Standard HNSW, a widely-used algorithm for vector search, is inefficient on disaggregated architectures because its greedy graph traversal requires many round trips over the network, creating high latency and bandwidth consumption. A naive implementation would be impractical due to the limited cache on compute nodes and the sheer size of the graph index stored in remote memory. - To solve this, d-HNSW introduces three key techniques: caching a lightweight "representative" index on the compute node to minimize remote access, using an RDMA-friendly data layout to reduce network round trips, and employing batched query-aware data loading to optimize bandwidth usage. - The research demonstrates a significant performance leap, with d-HNSW reducing query latency by up to 117x compared to a naive implementation on disaggregated memory, while maintaining a recall of 0.87 on the SIFT1M benchmark dataset. - Algorithmic breakthroughs like this directly impact the build vs. buy decisions for hyperscalers and AI companies by potentially lowering the total cost of ownership (TCO). By optimizing data movement, such solutions can reduce the reliance on more expensive, power-hungry hardware configurations for large-scale recommendation engines and Retrieval-Augmented Generation (RAG) systems.

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.