CVPR Paper Boosts Stereo Depth Estimation Speed by 10x
A paper accepted to CVPR 2026, Fast-FoundationStereo, presents a method for zero-shot stereo depth estimation that is more than ten times faster than previous approaches. The research aims to make foundational stereo models more practical for real-time applications. Other accepted CVPR papers include a method for sparse multi-view scene editing with text and a benchmark exposing LLM limitations in generating vector graphics.
Fast-FoundationStereo builds on its predecessor, FoundationStereo, an NVIDIA Research model from CVPR 2025 that received a Best Paper Nomination. While the original model established a new benchmark for zero-shot generalization, its computational demands made it impractical for real-time use. The 10x speedup is achieved largely through knowledge distillation. The large, computationally heavy "teacher" model, which uses a hybrid backbone including the ViT-based DepthAnything V2, is compressed into a lightweight "student" model with an efficient CNN backbone like MobileNetV2 or EdgeNeXt. Beyond distillation, the researchers employed a "divide-and-conquer" optimization strategy. This involved a blockwise neural architecture search to discover the most efficient cost-filtering designs under a strict latency budget, as well as structured pruning to remove redundant steps in the model's iterative refinement modules. To improve the distillation process and bridge the sim-to-real gap, the team developed an automatic pipeline to create pseudo-labels for a new dataset of 1.4 million in-the-wild stereo image pairs, supplementing the synthetic data used to train the original model. This performance leap is critical for production environments at companies like Meta or in Google's AR and robotics divisions. Real-time stereo depth is a long-standing bottleneck for applications in autonomous driving, dense 3D mapping, and AR/VR that require both high accuracy and low latency. The other accepted CVPR papers also highlight key industry challenges. The benchmark for vector graphics generation, for instance, probes a known weakness in LLMs: while they can handle the text-based SVG format, they often fail at the complex spatial reasoning required for accurate and complex image generation.