MIKASA releases VLA memory benchmark

- Researchers released MIKASA-Robo-VLA v1.0.0 as a benchmark for testing memory-heavy vision-language-action policies on tabletop manipulation tasks and recovery scenarios. (github.com) - The benchmark defines 90 tasks, 22,500 trajectories and more than 6 million timesteps, with instruction binding, temporal reasoning and recovery as core tests. (github.com) - The code, evaluation protocol and released datasets are available through the MIKASA-Robo repository and linked Hugging Face collections. (github.com)

MIKASA-Robo-VLA is a new benchmark aimed at a specific weakness in robot foundation models: remembering what happened earlier in an episode and using that memory to act correctly later. The release extends the earlier MIKASA-Robo work, which framed memory as a missing evaluation axis in tabletop manipulation, into a VLA-specific setup with standardized tasks, wrappers, datasets and scoring rules. (github.com) The immediate point of the release is not to show a new robot policy. It is to make memory failures measurable across models that take images, language and proprioception as input and output actions in simulated manipulation tasks. (github.com) The repository documentation says the VLA benchmark contains 90 tasks and groups them into canonical horizon splits because evaluating all tasks together is difficult when episode lengths vary widely. (github.com) ### Why does a VLA benchmark need a separate memory test? Tabletop manipulation often looks simple in demos because the robot can solve the next action from the current frame. MIKASA’s authors argue that many real tasks are partially observable instead: the agent may need to remember an earlier instruction, track an object after occlusion, or recover after the scene changes. (arxiv.org) The original MIKASA paper said standardized memory benchmarks were missing in robotic manipulation even though memory is essential for partial observability and long-horizon behavior. The older MIKASA-Robo suite organized memory challenges into four categories — object memory, spatial memory, sequential memory and memory capacity — and packaged 32 tasks across 12 groups for RL-style evaluation. (github.com) The VLA release builds on that base but shifts the setup toward image-and-language-conditioned policies and benchmark protocols used by current robot foundation models. ### What is actually being tested? The repository documentation says MIKASA-Robo-VLA targets memory-intensive manipulation episodes with horizons ranging from 25 to 2,160 simulation steps. That matters because short-horizon tasks can often be solved reactively, while long-horizon tasks force the policy to preserve information over time. (arxiv.org) The benchmark materials and release notes point to three recurring stress points: binding the right instruction to the right object, updating beliefs about object state as the scene changes, and using temporal context for delayed actions or recovery. In practice, that means tests where the robot must remember a color, shape, order, location or earlier scene configuration rather than just parse the latest camera image. (sites.google.com) ### How is the benchmark packaged for model builders? The GitHub documentation says every VLA environment must be wrapped with a benchmark-specific helper immediately after environment creation so that inputs and outputs match the released datasets. The benchmark exposes both privileged state observations for oracle-style debugging and RGB observations from two cameras plus proprioception for VLA evaluation. (github.com) The dataset release is also large enough to matter for pretraining and finetuning. The docs say MIKASA-Robo-VLA ships 22,500 trajectories — 250 per task across 90 tasks — and more than 6 million timesteps in NPZ, RLDS and LeRobot-compatible formats. (sites.google.com) Hugging Face pages for the public collection and dataset mirrors show those artifacts were updated in the days around the release. ### What counts as a result on this benchmark? The evaluation protocol defines `success_once` as the main metric and says users should report results by explicit horizon split — short, medium or long — rather than rely on automatic checkpoint detection. The docs also specify per-task JSON outputs and a summary file, which is meant to make comparisons reproducible across papers and codebases. (github.com) That structure is part of the benchmark’s main contribution. Instead of a one-off environment list, MIKASA-Robo-VLA provides a fixed protocol, released data, wrapper stack and reporting format that other labs can reuse when they claim a model has memory or temporal reasoning ability. (github.com) ### Where does this sit in the broader MIKASA project? The MIKASA project began as a broader “Memory-Intensive Skills Assessment Suite for Agents,” with MIKASA-Base for general memory RL and MIKASA-Robo for tabletop manipulation. The arXiv record shows the main paper was first submitted on Feb. 14, 2025 and revised on March 4, 2026, while the repository now labels MIKASA-Robo as an ICLR 2026 benchmark and notes that the default codebase is moving toward the VLA benchmark. (github.com) The next step for users is concrete: install the `mikasa-robo-suite` package, create one of the VLA environments, apply the required wrappers and evaluate by split using the published protocol and released datasets. The codebase, environment list and dataset links are already public in the MIKASA-Robo repository and associated Hugging Face collection. (github.com 1) (github.com 2) (arxiv.org)

MIKASA releases VLA memory benchmark

Get your own daily briefing