Roni Rahman’s multimodal demo
Roni Rahman demoed a reference‑driven generation system that accepts up to 12 multimodal inputs—images, video, audio and text—to produce more controllable creative outputs. The approach points toward richer, reference‑anchored generation useful for product photography, video variants and matched‑tone assets (X/Twitter).
Most image generators still work like giving directions to a stranger over the phone: you type a prompt, and every new render forgets half the last one. Roni Rahman’s demo showed a different setup, where the model can take as many as 12 references across image, video, audio, and text before it generates anything. (x.com) A reference is just an example the model can copy from without copying it exactly. One image can lock a shoe shape, one video can lock camera motion, one audio clip can lock pacing, and one text prompt can say what to change. (x.com) That solves a specific problem in generative media: consistency. Runway says its Gen-4 system keeps characters and locations stable from a single reference image, and Google says Veo 3.1 can use up to three reference images to preserve a person, character, or product across a video. (runwayml.com) (ai.google.dev) Rahman’s demo pushes that idea further by stacking many references instead of one or three. If a model can read 12 inputs at once, it can treat creative direction less like one sentence and more like a mood board, shot list, and soundtrack delivered together. (x.com) The practical use case is product photography, where brands need the same bottle, bag, or sneaker shown in dozens of settings without changing its shape or color. Companies already sell artificial-intelligence product-shot tools for apparel, cosmetics, bags, and eyewear because doing that with real studios is slow and repetitive. (kive.ai) The next use case is video variants. Google’s reference-image workflow is built around preserving the same subject while changing the scene, and Runway markets the same idea for characters, objects, and environments across many shots. (docs.cloud.google.com) (runwayml.com) Audio is the piece that makes Rahman’s demo stand out from most public tools. Image and video references are common now, but adding an audio reference means the system can anchor rhythm, mood, or voice-like timing at the same time it anchors visual identity. (x.com) (technologyreview.com) This is what people mean by multimodal generation. Instead of forcing everything through text, the model reads each medium in its native form, the way a human creative team might use a storyboard, a sample track, a product packshot, and a written brief in the same meeting. (cloud.google.com) (technologyreview.com) The hard part is not making one pretty frame. The hard part is keeping the same brand tone, object details, and motion style across 20 assets, and reference-driven systems are emerging because prompt-only systems drift too much from one output to the next. (runwayml.com) (docs.cloud.google.com) If this approach holds up outside demos, creative work starts to look less like “generate from scratch” and more like “generate from a kit.” The model becomes a production tool that remixes approved ingredients into new ads, clips, and catalog images without losing the original look. (x.com) (kive.ai)