AI Research Pushes Long-Form Video Understanding

Researchers are scaling video AI models to understand context over 10+ minutes with a technique called "Very Big Video Reasoning." The approach uses hierarchical compression to shift the model's focus from simple perception to memory and abstraction, a crucial step for analyzing long-form video content in news archives.

The new Very Big Video Reasoning (VBVR) suite is a significant leap in training data, containing over one million video clips and two million images across 200 different reasoning tasks. This is roughly 1,000 times larger than previous video reasoning datasets, which have been a major bottleneck in developing AI that can genuinely comprehend video content beyond simple object recognition. The dataset was a collaborative effort by over 50 researchers from institutions like Berkeley, Stanford, and Oxford. On the accompanying VBVR-Bench, an open-source Wan2.2 model fine-tuned on this new dataset saw its reasoning performance improve by 84.6% relative to its baseline. This fine-tuned model achieved an overall score of 0.685, outperforming proprietary models like Sora 2 (0.546) and Veo 3.1 (0.480). However, there is still a significant gap to close, as human performance on the same benchmark is 0.974. For newsrooms, this leap in reasoning could transform video archives from passive storage into queryable databases. Instead of simple keyword searches, a journalist could ask, "Show me every instance in our archive where a local official appeared at a construction site and mentioned 'budget overruns,' and create a transcript." Another query could be, "Analyze all our B-roll footage from the last five years of downtown protests and identify any recurring symbols or banners." Architectures are also evolving beyond brute-force processing. The VideoAgent model, for example, uses a Large Language Model as a central 'agent' to decide what information is needed from a video, then uses vision tools to retrieve only those specific frames. On the challenging EgoSchema benchmark, VideoAgent achieved 54.1% accuracy using an average of only 8.4 frames, demonstrating significant efficiency gains over processing entire video streams. This approach reduces computational load and mirrors human-like selective attention. From an infrastructure perspective, these models demand robust server configurations. A typical setup for training involves multi-GPU servers with enterprise-grade CPUs like Intel Xeon or AMD EPYC to prevent data pipeline bottlenecks. Each server would ideally house four to eight high-performance GPUs, such as NVIDIA's H100, each with at least 80GB of VRAM. System RAM should be at least double the total GPU VRAM, meaning a server with four 80GB GPUs would require a minimum of 512GB of system RAM. The underlying technology is also shifting. State-space models (SSMs) like Mamba are being adapted for video (VideoMamba) to more efficiently handle long-range dependencies with linear complexity, a significant advantage over the quadratic complexity of traditional transformer models. Hierarchical compression techniques are also being used to create multi-scale representations of video frames, allowing for faster processing and lower memory usage without the need for traditional, computationally expensive motion estimation. However, significant limitations in AI reasoning persist. A recent Apple study highlighted that even the most advanced models struggle with true causal understanding and can be easily misled by irrelevant information, often relying on pattern matching rather than genuine comprehension. Research from 2025 also shows that as task difficulty increases, the ability of Large Reasoning Models to follow specific instructions within their reasoning process declines significantly, with some models failing more than 75% of the time. Looking ahead, the next frontier is unifying video understanding and generation within a single multimodal model. This would enable interactive video analysis where a news organization could not only query their archives but also generate new content based on the findings. For instance, a user could ask the system to "find all clips of the mayor's press conferences on the new stadium and create a 2-minute summary video with a voiceover highlighting key promises."

AI Research Pushes Long-Form Video Understanding

Get your own daily briefing