










Google CEO Sundar Pichai speaks during the 2026 Google I/O technology developer conference in Mountain View, California, on May 19, 2026. (Photo by Karl Mondon / AFP via Getty Images)
AFP via Getty Images
Google used its latest I/O event this week to introduce Gemini Omni Flash, a new AI model that can take text, photos, video, and audio as inputs, then produce short video clips with audio. It is launching through the Gemini app, Google Flow, and YouTube Shorts, with current clips up to 10 seconds and longer formats planned. Google’s latest video announcements show that the industry is focusing on more than just another text-to-video demo. AI is working its way more into the process of video creation.
Early AI video tools worked like most other prompt-to-output generators. Type a prompt and get a clip, and if you don’t like it then just try again. Gemini Omni Flash moves closer to a video assistant. You can give it existing media, ask it to change that media and use conversation to guide the result. Google says the Omni family is designed around creating “anything from any input,” with video as the first major format. Reports from Google I/O 2026 say Gemini Omni Flash is launching through the Gemini app, Google Flow, and YouTube Shorts. The Verge also reported that current clips are up to ten seconds, with longer formats planned.
Google is adding more to their already somewhat overwhelming line of video-oriented AI models. Google already has Veo, its dedicated AI video model. Veo 3.1 is built for high fidelity video generation, with native audio, stronger prompt following, cinematic controls, and output options that include 720p, 1080p, and 4K through the Gemini API. Veo 3.1 Lite is the lower-cost version for scaled developer and enterprise use. Flow is Google’s AI filmmaking workspace, Google Vids is the Workspace tool for business videos, Gemini offers a consumer entry point for casual video creation, Vertex AI and the Gemini API give developers programmatic access to Veo.
Gemini Omni Flash is different in scope. Gemini Omni Flash is the broader multimodal model for creating and editing video from text, images, audio, and video through conversation. It is part of Gemini, so it is less about a standalone video engine and more about multimodal creation. It can use text, photos, videos, and audio as starting material. Omni Flash also benefits from broader Gemini training and world knowledge, which could make it better at interpreting context than a video model that only reacts to prompts.
Together, these options give Google a wide video AI stack for consumers, creators, businesses and developers, though the number of overlapping names and entry points could make the product story harder for users to follow. The combination is powerful, however. A basic AI video tool might turn one prompt into a clip. A more useful system could use all those inputs, create several short videos, revise them through chat, and format them for Shorts.
MORE FOR YOU
Given the potential for malicious and harmful use of generated video outputs, Google is also putting safety markers around the output. Google’s AI generated video content will carry SynthID watermarks and content verification tools.
Creators, marketers, agencies, studios, and software platforms now are aware that AI can make a good quality video output clip. The harder question is whether AI can take a product image, a brand guide, a voice memo, three customer reviews, a half-finished storyboard, yesterday’s top-performing ad, and turn all of that into usable video assets that can be revised, tested, localized, approved and shipped.
Google says Gemini Omni Flash can create video from text, images, audio, and video, with the larger Gemini Omni family built around the idea of creating “anything from anything.” Google also says the model brings Gemini’s reasoning together with media generation and editing.
While Veo remains Google’s dedicated video model, with Veo 3.1 focused on video quality, native audio, realism, and creative control, Gemini Omni Flash points toward broader use. It can use different media inputs, generate video with audio, and support conversational editing. This means it’s less of an output oriented tool and more of a visual editor with memory similar to how agentic coding tools like Claude Code and OpenAI’s Codex have shifted away from just one-time outputs to managing the whole process.
Using Gemini Omni, a marketer could ask for three YouTube Shorts based on a product photo and a customer quote. A founder could feed in a rough iPhone clip and ask for a cleaner version that keeps the same energy. A retailer could request twenty variants of the same seasonal promotion, each tuned for a different buyer segment. The machine goes beyond generating pixels and helps manage the production process complexity that sits between the idea and the publish button.
Other vendors are building the workspace around AI video, increasing competition in the market and potential for confusion with customers. Higgsfield offers an AI video generator and studio where users can access several major models in one place, including Kling 3.0, Veo 3.1, Sora 2, Seedance 2.0, Wan 2.7 and others, then compare outputs, control camera moves, manage motion, and shape style without leaving the platform.
Magnific, formerly Freepik, is taking a related route from the creative asset side. The renamed platform now combines AI image and video generation, 4K video with audio, upscaling, enhancement tools, collaboration, 3D and virtual scene tools, an AI assistant, training, and a library of more than 250 million creative assets. That makes Magnific less like a pure video model company and more like a full creative production suite. Its advantage comes from starting with a huge base of stock imagery, design assets, and creative users, then layering AI generation and editing on top.
Runway, Luma, and similar tools are also focusing on the process and flow by offering a range of model choice, repeatable styles, character consistency, camera control, brand assets, collaboration, templates, approvals and output quality. Chinese models from ByteDance and Kuaishou add more pressure, with Seedance and Kling pushing features such as multimodal inputs, multi-shot generation, native audio, lip sync, and faster short-form video creation.
The broader market is splitting into two camps. Google and OpenAI have focused on frontier models and direct product surfaces. OpenAI pushed Sora 2 as a flagship video and audio generation model with synchronized dialogue and sound effects, although OpenAI’s own page now says the Sora product is no longer available, and its developer documentation says the Sora 2 video models and Videos API are deprecated and will shut down on September 24, 2026.
With this increasing move upstream into more of the production and creative process, voices from the creative economy are getting increasingly concerned. Many are wondering if AI video will replace filmmakers and production crew.
AI video has now reached the feature-film proof-of-concept stage. Hell Grind, a 95-minute AI-generated sci-fi action film from Higgsfield AI, screened around Cannes in May 2026. The Wall Street Journal reported that the film was made in roughly two weeks for about $500,000, with $400,000 spent on AI compute. The production still required a 15-person team and heavy human direction, with the first 25 minutes alone reportedly taking more than 16,000 initial generations, later cut down to 253 final shots. While Hell Grind does not prove that a studio can type “make an AI movie” and receive a finished feature, it does illustrate that AI video can now support longer-form production when people supply detailed prompts, creative judgment, editing discipline, and enough computing power.
The cause for concern is real because much of the work in video production is expensive and repetitive. It often lives across multiple applications, conversation threads, editing timelines, asset folders and approval queues. As studios and production houses make increasing use of AI, they will no doubt be pulled to make greater use of production-focused tools.
Human creativity of course still matters. Creative style, judgment, timing, narrative instinct, risk sense, legal caution and audience knowledge become the scarce goods. While a machine can generate ten options, someone still has to know which one feels cheap, which one feels uncanny, which one violates the brief, and which one might actually sell.
On the risk side, the more precise these tools become, the easier it gets to create convincing synthetic people, fake endorsements, unauthorized likenesses and brand unsafe media. There are still hard questions such as what data trained the model? What likeness rights are protected? Can outputs be traced? Can a company prove what was generated, who approved it, and what assets went into the final cut? Can the workflow stop a rogue campaign before it reaches customers?
So while pretty clips got everyone’s attention with AI generated video, the real future is around controlled production.
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。