Stability AI Releases Open-Source Video Generator
Stability AI has launched Stable Video Diffusion, an open-source model for text-to-video generation. The tool converts text prompts into video clips and allows for customizable frame rates from 3 to 30 fps. The model is positioned as a flexible, rapid-prototyping alternative to proprietary tools like OpenAI's Sora and ByteDance's Seedance.
- The initial release consists of two image-to-video models: SVD, which generates 14 frames, and SVD-XT, which is fine-tuned to generate 25 frames. The model's code was made available on GitHub, with weights accessible on Hugging Face for research-only purposes. - To train the model, Stability AI developed a "Large Video Dataset" (LVD) containing 580 million video clips, which amounts to a runtime of 212 years. A systematic curation process involving synthetic captioning and aesthetic scoring was used to prepare the data. - The model is an extension of the Stable Diffusion 2.1 architecture, incorporating temporal convolution and attention layers to model the time dimension in videos. This allows it to generate video from a single conditioning image. - While OpenAI's Sora can generate videos up to 60 seconds long at 1080p resolution, Stable Video Diffusion's output is shorter, around 4 seconds, at a resolution of 576x1024 pixels. SVD's key differentiator is its open-source availability, allowing for local execution and fine-tuning. - User controls offer a degree of creative input beyond the initial prompt. Key parameters include `motion_bucket_id` to adjust the intensity of motion and `noise_aug_strength`, which influences how closely the video output adheres to the initial conditioning image. - In early user preference studies, Stable Video Diffusion was preferred over closed models like Runway Gen-2 and Pika in terms of overall video quality. - The model serves as a base for more advanced applications; Stability AI has since released Stable Video 4D, which builds on SVD to generate 3D video with multiple camera angles from a single video input. - Beyond text-to-video, the model is designed for multi-view synthesis from a single image, a feature that can be enhanced by fine-tuning on specific multi-view datasets.