CaP‑X: Embodied AI Benchmark

Researchers from NVIDIA, Berkeley, CMU and Stanford unveiled CaP‑X, an open framework/benchmark that stress‑tests coding agents as they write and deploy perception and control code across sim and real hardware—aiming to standardize embodied AI evaluation. The effort signals growing demand for reproducible sim‑to‑real benchmarks. (x.com)

ArXiv submission lists the paper as "CaP‑X" submitted March 23, 2026, with lead authors Max Fu, Justin Yu and coauthors including Fei‑Fei Li, Jiajun Wu, Shankar Sastry, Ken Goldberg, Yuke Zhu and Linxi "Jim" Fan. (arxiv.org)) The released codebase and docs break CaP‑X into CaP‑Gym (interactive environments), CaP‑Bench (an 8‑tier benchmark S1–S4 and M1–M4), CaP‑Agent0 (a training‑free agent framework), and CaP‑RL (a reinforcement‑learning extension), and the repo enumerates 39 benchmark tasks drawn from Robosuite, LIBERO‑PRO and BEHAVIOR. (github.com)) The authors evaluated 12 frontier models across a focused set of 7 core manipulation tasks and report that the best model achieves over a 30% average success rate across tasks while models consistently lose performance as available human‑crafted abstractions are removed. (arxiv.org)) CaP‑X identifies three mitigation levers—multi‑turn interaction, structured visual differencing (VDM), and automatic skill synthesis—and uses those to build CaP‑Agent0, which the paper says recovers human‑level reliability on several simulated and real robotics manipulation tasks. (arxiv.org)) The team also introduces CaP‑RL, a post‑training RL pipeline that uses verifiable environment rewards to improve coding‑agent success rates and reports transfers from sim to real with minimal performance gap when combined with the CaP‑X tooling. (arxiv.org)) The project is open on GitHub (capgym/cap‑x), includes setup instructions that require Python 3.10 and a CUDA‑capable GPU, notes simulator-specific installs (Robosuite, a separate LIBERO virtualenv, and NVIDIA Isaac Sim for BEHAVIOR tasks), and provides scripts for running the 39 tasks and the web UI. (github.com))

Get your own daily briefing

Scout delivers personalized news, insights, and conversations tailored to your role and industry.

Download on the App Store

Shared from Scout - Be the smartest in the room.