OpenAI and Meta Push Text-to-Video Generation Forward

The field of text-to-video synthesis is advancing rapidly, with OpenAI's Sora 2 and Meta's generative video research at the forefront. Industry analysis notes that Sora 2 demonstrates significant improvements in video length and scene consistency. Meanwhile, Meta's more open research approach is driving wider innovation across the academic and developer communities.

- OpenAI's Sora 2, which began its rollout in the U.S. and Canada on September 30, 2025, can generate videos with synchronized dialogue, sound effects, and background audio. The model is designed to simulate physical interactions like gravity and collisions for more realistic motion. - Sora 2 is accessible via iOS and Android apps and is capable of producing videos up to two minutes long with resolutions as high as 4K. Its API also supports 720p and 1080p resolutions, and it can generate video from both text and still image inputs. - Meta's "Movie Gen," announced in October 2024, consists of a family of models, including a 30-billion-parameter transformer for video and a 13-billion-parameter model for audio. This system was trained on a dataset of over 100 million video-text pairs and 1 billion image-text pairs. - The Movie Gen model can generate 1080p videos up to 16 seconds long at 16 frames per second and also performs instruction-based video editing, such as changing backgrounds or modifying styles. - Unlike some closed models, Meta has a history of open-sourcing its AI research, such as the models in its "Make-A-Scene" and "Make-A-Video" projects, which helps accelerate development across the academic and open-source communities. - The text-to-video space also includes significant models from other major tech companies, such as Google's Veo and a growing number of open-source alternatives. Researchers are focused on overcoming key challenges like maintaining temporal consistency across longer videos and managing the high computational cost of generation. - A core technology used in these advanced models is the transformer architecture, which allows the system to understand long-range dependencies in data, crucial for creating coherent video sequences.

OpenAI and Meta Push Text-to-Video Generation Forward

Get your own daily briefing