New AI Memory Gives Robots 15-Min Context
The Physical Intelligence Team has unveiled MEM, a multi-scale memory system for robots that gives foundation models a 15-minute context window. This breakthrough allows robots to remember and reason across complex, long-horizon tasks, closing a key gap between machine and human learning in physical environments.
MEM tackles a core bottleneck in robotics: the "goldfish effect," where robots, particularly Vision-Language-Action (VLA) models, operate with only a few seconds of history. This limitation has historically confined even advanced systems to short, simple tasks, as they lack the context to handle multi-stage operations like cleaning a kitchen or preparing a recipe. The system's innovation lies in its dual-scale memory architecture, which mimics human-like memory by separating short-term visual data from long-term semantic understanding. For immediate actions requiring fine-grained spatial awareness, like adjusting a grip, MEM uses an efficient video encoder to process recent visual frames. This avoids the high computational cost of feeding minutes of video into the model's context window. For long-horizon context, MEM summarizes events into a natural language "narrative." Instead of storing every visual frame of a refrigerator door opening, it creates a text-based note like "I opened the fridge door." This chain-of-thought process allows the robot to track its progress over tasks lasting up to 15 minutes, a significant leap for VLA models. Developed by a team from Physical Intelligence, Stanford, UC Berkeley, and MIT, MEM is integrated into the π0.6 VLA, which is built upon a pre-trained Gemma 3-4B model. This foundation model was pre-trained on a diverse mix of robot demonstrations, vision-language tasks, and internet video data, providing a rich base for the memory system. This architecture directly addresses "causal confusion," a common failure mode where a robot erroneously repeats past actions simply because they are in its recent history. By distinguishing between immediate visual cues and a longer-term task summary, the system can adapt its strategy based on recent failures. This resulted in a 62% success rate increase in opening refrigerators with unknown hinge directions during evaluations.