Multimodal search case study

A technical deep dive outlines how Netflix builds multimodal video search to return ‘speed of thought’ queries across billions of records — a data‑engineering scale problem involving indexing, retrieval and ranking layers. The writeup is presented as a leadership case study in building low‑latency, high‑scale search over diverse media types (X/Twitter).

Searching video with plain English means turning film into data first, and Netflix says it built that system to answer complex queries across billions of records at low latency. (medium.com) A video search engine does not index only titles or captions. Netflix’s April 4, 2026 engineering post says its system combines signals from characters, scenes, dialogue and machine-generated embeddings, which are numerical summaries used to find similar content. (techlist.io) The company framed the problem around raw footage volumes that can reach hundreds or thousands of hours for a season or franchise. A newsletter summary of the post said the goal was to help editorial teams surface key moments faster than keyword search could. (tldr.tech) Netflix said the system does not rely on one model doing everything. A mirrored copy of the post says it orchestrates specialized models for character recognition, scene understanding and dialogue parsing, then aligns their outputs on a shared timeline. (tool.lu) That timeline is broken into one-second buckets, which act like uniform index cards for video. Another summary of the post said Netflix stores model outputs in those buckets so a query can combine multiple clues tied to the same moment. (bool.dev) The architecture is split into layers so writes and reads do different jobs. A reproduced excerpt says raw annotations first land in an annotation service backed by Apache Cassandra for high write throughput, then move through asynchronous processing before reaching Elasticsearch for querying. (entertainer.news) That design matches a broader shift inside Netflix’s media data stack. In August 2025, LanceDB said Netflix had built a Media Data Lake to unify metadata, raw media and embeddings for machine learning workloads over petabytes of assets. (lancedb.com) Netflix described that 2025 effort as a new Media Machine Learning Data Engineering specialization. InfoQ’s August 25, 2025 report said the team was created to handle video, audio, text and image assets that do not fit neatly into traditional analytics tables. (infoq.com) The search writeup also fits a wider push to make video searchable the way text and images already are. LanceDB’s November 2025 newsletter, citing a joint talk with Netflix, said the system supported text-to-text, text-to-image, image-to-image and image-to-text retrieval across hundreds of terabytes with very low latency. (lancedb.com) Netflix’s own research and engineering pages have recently added more media-understanding work, including MediaFM and the April 2026 search post. Taken together, those posts show a company building search around the timeline inside the video, not just the title on top of it. (research.netflix.com, medium.com)

Multimodal search case study

Get your own daily briefing