AI video tools getting more controllable
Adobe Research presented Vidmento, a tool that fills gaps in footage with generated clips matching style and narrative while keeping creators in control. (x.com) Broader trends toward consistent styles and transcript-driven edits—seen in Runway Gen‑4, Descript and Google Veo 3—show vendors prioritising controllability and predictable outputs. (x.com)
A new crop of artificial intelligence video tools is moving from “make me a clip” to “help me finish a scene I actually need.” (research.adobe.com) Adobe Research’s Vidmento, published March 31 and accepted to the 2026 Conference on Human Factors in Computing Systems in Barcelona, is built to fill missing shots inside an existing video story instead of starting from a blank prompt. The system analyzes footage a creator already has, identifies narrative gaps, and suggests generated clips that match the surrounding material’s style and story. (research.adobe.com) The tool uses two linked workspaces: a visual canvas for arranging scenes and a script editor for shaping voiceover and story structure. In a paper posted on arXiv, the authors said they interviewed eight creators before building the system and then studied it with 12 creators. (arxiv.org) This is a shift in what “control” means in generative video. The problem is less making any moving image than making a specific missing shot that fits between two real shots without breaking continuity. (arxiv.org) Runway has been pushing the same idea from the model side. Its Gen-4 release says users can keep characters, locations and objects consistent across scenes, use reference images, and regenerate the same subject from different camera positions without extra training. (runwayml.com) Google has been pushing it from the editing side. In October 2025, Google said Veo 3.1 added “more narrative control,” while Flow added tools including “Frames to Video” for bridging a start image and an end image, “Extend” for continuing a shot, and “Insert” for adding new elements into a scene. (blog.google) Google also added image-to-video support for Veo 3 and Veo 3 Fast in July 2025, saying creators could start from one image and guide motion, narrative and audio while maintaining consistency from that first frame. The company priced Veo 3 Fast at $0.40 per second with audio and Veo 3 at $0.75 per second with audio in the Gemini Application Programming Interface preview. (developers.googleblog.com) Descript has made a similar bet on predictability for talking-head and social video. Its editor turns audio or video into a transcript, lets users cut the video by editing text, and offers tools to auto-add B-roll, layouts, captions and regenerated audio-video fixes from typed edits. (descript.com) Adobe’s version is narrower than a general-purpose video generator and closer to a co-pilot for editors who already have footage. The Vidmento paper describes this as “generative expansion”: using artificial intelligence to add the shot that was never filmed, while keeping the creator in charge of where it goes and what story it serves. (research.adobe.com; arxiv.org) That is where the market is heading in 2026: not away from generation, but toward tools that can hold a look, follow a script and behave more like an editor’s assistant than a slot machine. (research.adobe.com; runwayml.com; blog.google; descript.com)