Model Trained to Drive in SF with One Hour of Footage

Standard Intelligence has demonstrated a model trained to drive a car in San Francisco using only one hour of video footage. The breakthrough was reportedly achieved using an innovative inverse dynamics model that automatically labels large, unlabeled datasets. This technique could potentially accelerate the development of autonomous agents for both physical and computer-based tasks.

- The model, named FDM-1 (Foundation Driving Model 1), was pretrained on a massive 11-million-hour dataset of screen recordings before being fine-tuned on the one hour of driving footage. [- Standard Intelligence's](https://vertexaisearch.cloud.google.com/grounding-api-redirect/AUZIYQEjrj2WA-0L8CaMTuVbNL_R5OnYrpTNaO4fnLxPhK1BoNXDwTyykLhOe0D95CtI8t5IvrhS8jjgeFNUj_RVUtY-3nzVj0BxTghdKUzPG725fX9Tz4fj) inverse dynamics model (IDM) was crucial for this process, as it automatically labeled the vast, unlabeled internet video dataset with predicted computer actions like mouse movements and key presses. - For the driving demonstration, the model controls the vehicle by outputting key presses to a web interface that is a modified version of Openpilot's "joystick mode". - This approach of pre-training on general computer usage data before fine-tuning on a specific task like driving is what allowed the model to achieve 50% accuracy on key press prediction from the start, significantly outperforming a model trained from scratch. - An innovative video encoder was developed that can compress nearly two hours of 30 FPS video into just 1 million tokens, a roughly 50x improvement in efficiency that enables training on such long-context video data. - Standard Intelligence, the company behind this breakthrough, was founded in 2017 and previously focused on AI-powered autonomous checkout for retail stores, having raised $236 million in funding. - The company reached a $1 billion valuation back in February 2021 after a $150 million Series C funding round led by SoftBank Vision Fund 2. - Inverse dynamics, the core technique used, is a method common in robotics that calculates the forces or torques required to produce a desired motion, and is now being applied to learn from observation in machine learning.

Model Trained to Drive in SF with One Hour of Footage

Get your own daily briefing