RobotSeg: first robot segmentation model

Researchers presented RobotSeg, a foundation model and dataset for segmenting robot arms and grippers in images and video, aimed at enabling better visual servoing and safety checks. It was accepted as a CVPR 2026 oral paper, which highlights growing attention to general datasets that directly support manipulation and low‑level perception. Having a standard segmentation model simplifies data augmentation, sim‑to‑real transfer, and visual safety tooling for manipulation stacks. (x.com/MikeShou1/status/2042224186974982229)

Most robot vision systems can tell you where the coffee mug is, but they still struggle with a simpler question: which pixels belong to the robot’s own arm. RobotSeg is a new model built specifically to draw that outline in both images and video, down to the arm, gripper, and whole robot. (github.com, arxiv.org) That job is called segmentation. It means coloring in every pixel that belongs to one thing, the same way a child might fill in only the fire truck on a coloring page and leave the street blank. (arxiv.org) Robots need that pixel map because a camera-guided controller works from what the camera sees, not from a clean blueprint of the scene. If the system loses track of the robot’s wrist or gripper by even a small patch of pixels, a grasp can drift or a safety check can miss a near-collision. (arxiv.org) General-purpose segmentation models already work well on people, dogs, and cars, but RobotSeg’s authors say robots are a special failure case. A robot arm can change shape quickly, hide parts behind itself, and blend into factory backgrounds with the same gray metal colors. (github.com, arxiv.org) The team’s answer was to build both a model and a dataset together. Their video robot segmentation dataset contains 2,812 videos and 138,707 frames with fine-grained masks for robot arm, gripper, and whole robot. (github.com, arxiv.org) Those videos cover 10 robot embodiments, which is a research term for different robot body designs. The list includes Franka, Universal Robots UR5, Kuka iiwa, Sawyer, Hello Robot Stretch, Google Everyday Robot, and several others. (github.com) RobotSeg is built on Segment Anything Model 2, which is Meta’s video-capable segmentation system, but it adds robot-specific parts. One part tries to remember a robot’s structure across frames, so a gripper stays a gripper even when it swings fast or passes in front of clutter. (github.com, arxiv.org) Another part removes some of the human clicking usually needed to start segmentation. The paper says its prompt generator can produce robot prompts automatically instead of waiting for a person to draw a box or tap points on the image. (github.com, arxiv.org) The training setup also cuts labeling work. Instead of requiring a hand-drawn mask on every frame of a video, the authors say RobotSeg can learn with only the first frame labeled and then enforce consistency across the rest. (github.com, arxiv.org) On the benchmark they report state-of-the-art results against robot-focused baselines such as RoVi-Aug and RoboEngine, and they say the model also beats several language-conditioned segmentation systems and Segment Anything Model 2.1 under multiple prompt settings. The GitHub page lists the model at 41.3 million parameters and over 10 frames per second at inference. (github.com) The paper was accepted as an oral at the 2026 Conference on Computer Vision and Pattern Recognition, a slot usually reserved for a small share of submissions. That says something about where robot perception research is moving: not just bigger action models, but cleaner low-level tools that let robots see their own hands reliably before they try to use them. (x.com, openreview.net)

RobotSeg: first robot segmentation model

Get your own daily briefing