Netflix multimodal search pipeline

Netflix's multimodal video search reportedly processes more than 216 million frames from about 2,000 hours of content using a decoupled three-stage pipeline that targets sub-second latency on billions of records. The case study highlights a design that separates stages to keep latency low while scaling across massive video corpora. (x.com) (x.com)

Searching video with artificial intelligence means turning frames, dialogue, and scene labels into data that a computer can query. Netflix said on April 3 that its video-search system is built to answer those queries in under a second across billions of records. (medium.com) (tool.lu) The company said a 2,000-hour production archive can contain more than 216 million frames before any machine-learning processing begins. It said those frames become billions of data points after multiple models analyze characters, scenes, dialogue, labels, and embeddings, which are numerical fingerprints used to compare meaning. (tool.lu) Netflix said the system uses a decoupled three-stage pipeline instead of doing every step in one pass. Raw model output is first written into an annotation service backed by Apache Cassandra, a distributed database built for heavy write traffic. (entertainer.news) A second stage runs offline after Apache Kafka publishes an event, so the expensive work happens away from the live ingestion path. Netflix said that job aligns overlapping model outputs into one-second buckets and fuses matches like a character detection and a kitchen scene into a single record for that second. (entertainer.news) That design addresses a basic video problem: different models look at different slices of time and produce different kinds of output. Netflix said critical moments can be missed at scene boundaries unless those disjointed timelines are synchronized into one chronological map. (tool.lu) The company placed this work inside a broader push to organize media as machine-learning data rather than just files in storage. In February, Netflix described “MediaFM,” its multimodal foundation for media understanding, and in August 2025 it described a Media Data Lake built to store metadata, embeddings, and raw assets together. (medium.com) (lancedb.com) That broader system reflects the scale Netflix says its studio operation already handles. In a 2024 engineering post, the company said hundreds of Netflix studios generate about 2 petabytes of data per week across text, images, image sequences, and large media files. (medium.com) Netflix’s latest case study focuses on internal video search for creative and editorial work, not the consumer search box subscribers use to find shows. The company said the goal is to surface moments buried inside raw footage quickly enough that editors and filmmakers do not lose time scanning material by hand. (tool.lu) The thread running through all of it is separation: write fast, fuse later, query fast again. Netflix’s account argues that splitting those jobs is how a search system keeps sub-second responses even when every useful moment starts as one frame in a very large pile of video. (entertainer.news) (tool.lu)

Netflix multimodal search pipeline

Get your own daily briefing