INSID3 accepted as CVPR Oral

INSID3, a training‑free in‑context segmentation method that uses frozen DINOv3 features, was accepted as an Oral at CVPR 2026 and the author shared the paper and GitHub links. The method claims to enable segmentation via in‑context cues without additional training. (x.com/gabTrivv/status/2043690470761542113)

Image segmentation is the task of coloring in the exact pixels that belong to an object, like tracing every pixel of a dog instead of just drawing a box around it. INSID3, a new paper accepted as an oral presentation at the 2026 Conference on Computer Vision and Pattern Recognition, says it can do that from examples without any extra training. (arxiv.org) (github.com) The setup is close to a visual prompt: mark one or a few examples of a category in a reference image, then ask the system to find the matching region in a new image. The authors say INSID3 can segment objects, object parts, and personalized instances using a single frozen DINOv3 backbone. (visinf.github.io) (arxiv.org) A frozen backbone is a pre-trained vision model whose weights are left unchanged, like using a camera lens as-is instead of rebuilding it for each task. The paper argues that scaled-up dense features from DINOv3 already contain enough spatial structure and semantic correspondence to support segmentation directly. (arxiv.org) That is the main claim in INSID3: no fine-tuning, no segmentation decoder, and no auxiliary models. The project page says the method works directly from DINOv3 features and clusters them into coherent object and part regions. (visinf.github.io) (github.com) The authors also say they found a positional bias in DINOv3 features, meaning the model can react not just to what an object is but to where it sits in the image. Their fix is a training-free projection that removes a low-dimensional positional component before matching reference and target regions. (visinf.github.io) (github.com) On benchmarks, the paper reports state-of-the-art results across one-shot semantic, part, and personalized segmentation, with a gain of 7.5 points in mean Intersection over Union over prior work. It also says the method uses three times fewer parameters than earlier systems it compares against. (arxiv.org) (catalyzex.com) The project page adds a speed comparison: 3.31 frames per second for INSID3, versus 0.97 for GF-SAM and 0.11 for Matcher in the authors’ tests. Those figures come from the authors’ own evaluation materials, not an independent benchmark. (visinf.github.io) The paper lists Claudia Cuttano, Gabriele Trivigno, Christoph Reich, Daniel Cremers, Carlo Masone, and Stefan Roth as authors, and arXiv shows it was submitted on March 30, 2026. The public code repository is live on GitHub under visinf/INSID3, and the README labels it the official repository for a “CVPR 2026 Oral” paper. (arxiv.org) (github.com) Conference on Computer Vision and Pattern Recognition oral slots are a small share of accepted papers, which makes the format itself a signal that program chairs ranked the work near the top of the conference. Public CVPR 2026 listings were available, but the oral designation for INSID3 was easiest to verify from the authors’ repository rather than a searchable conference page. (github.com) (cvpr.thecvf.com)

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.