惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Stop Calling It an AI Assistant. It’s Already Managing Your Company Why Hardcoded Automations Fail AI Agents Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution AWS Savings Plan Buying Strategy: How to Layer, Size, and Time Commitments application.properties I built a macro tracker powered by AI + attitude Solace: A Global Mental Health First Responder Built with Gemma 4 Why Blocking Prompt Injection Is Wrong — and What to Do Instead The AI code tools Dutch developers actually use in 2026 (field notes) Automatic Error Recovery in AI Agent Networks You Are Not Choosing Building a Cinematic Adaptive Learning Intelligence with Gemma 4, Gemini, and OpenAI(Powered by Gemma 4) CLAUDE.md for Angular: 13 Rules That Make AI Write Idiomatic, Production-Ready Components I tested 7 vector databases for my RAG stack in 2026, here's the one nobody is talking about (yet) Claude agreed with a false fact I gave it. Confidently. That broke my workflow Google's "Budget" Model Just Beat Its Own Flagship. Here's What That Actually Means for Developers. How I built a monitoring SaaS for Joomla, WordPress & PrestaShop agencies Shifting from Passive Dashboards to Automated Remediation: A Guide to Next-Generation FinOps and CloudZero Alternatives Automating CSV WooCommerce Imports Without Plugins Why Wobbly Plugs and Overheating Outlets Are More Dangerous Than You Think (UL 498 Explained) Building an AI Model Evaluation Pipeline on AWS for Audio Content Generation Your Side Project Is Not a Business Neurodiversity and the two layers of cognition GitHub Internal Repositories Breached: Source Code and Internal Data Allegedly Exfiltrated in 2026 Supply Chain Attack Stop drowning in files: auto-organize your Google Drive with n8n (free workflow JSON) Secure Firmware Updates with a Secure Element: Building Trust Into the Bootloader I Thought Domain-Driven Design Was a Waste of Time. I Was Wrong. AI Content Is Getting Tagged Like Livestock — And That's Actually Good ESP32 Into a Speech-to-Text Device Why Simple Audio Transcription Fails in Healthcare: The Need for Clinical Reasoning Engines The 114KB Span Attribute That Hid Our LCP Data How to Scale AI Development Beyond Prototype Speed Agent Execution Environments: Cloud Sandbox vs Local GUI vs Hybrid AI code review checklist that actually catches problems What’s the best tech stack for AI app development? Arc 1 Recap: Keypairs, Wallets, and Solana Fundamentals How Wearables Are Changing Human Decision-Making (Without Us Realizing It) The Perils of Premature Optimisation in Distributed Treasure Hunts Why Engineers Wear Hoodies While Social Media Sells Perfection Stop Treating setTimeout(fn, 0) Like Magic Save any webhook data to a database automatically with n8n — free workflow JSON Translating an entire multilingual site shouldn't mean re-prompting an LLM for every file I built a Vite plugin that uses AI to author Playwright tests, then gets out of the way Project: Restaurant Delivery CRUD Three weeks after I said CLAUDE.md writes itself, it added 4 more rules without me Why On-Device AI Is Quietly Winning Over Cloud Inference — Three Reasons You Didn't See Coming Trois semaines après avoir dit que mon CLAUDE.md s'écrivait tout seul, il a ajouté 4 règles sans moi
Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline
shinji shimi · 2026-05-22 · via DEV Community

shinji shimizu

TL;DR

Gemma 4 31B expands a single-line idea into a 10-beat structure. HiDream generates 11 images at 2048², LTX-2 A2V/I2V renders 11 clips, Irodori-TTS handles dialogue and a male narrator, and ffmpeg burns in subtitles and a Hook title overlay — all fully automated. End-to-end: a 40-second portrait video (512×768) in 25–30 minutes. One local GPU (96 GB Blackwell), zero API cost.

Finished video (already published):

@youtube

Who This Is For

Individual developers who want to mass-produce AI comedy shorts on a local GPU. The focus isn't on any single model — it's on the design of chaining multiple models into one operational pipeline.

What I Built

I automated a dark-comedy format — a short-video style I called consent_dilemma — from a one-line idea all the way to a finished 40-second video.

Finished structure:

  • Hook (0–5s): Extreme close-up of a beautiful woman + narrator "The fate of the man who answered 'You're a guy, aren't you'——" + large title overlay
  • Main section (5–37s): Movie theater date → "Can I kiss you?" → "No… stop it…" → dejection → "Why aren't you more assertive? You're a guy, aren't you?" → realization → kiss
  • Punchline (37–40s): Courtroom — "The defendant is sentenced to 3 years for non-consensual intercourse" + gavel "Knock!" + tears in a jail cell

Before / after:

Traditional approach This pipeline
Idea → published video 2–3 days (manual editing) 25–30 minutes (fully automated)
API cost Hundreds of yen per video (DALL-E + video gen) ¥0 (electricity only)
Subtitles Write SRT by hand Auto-split on punctuation and burned in
Hook Shot separately Integrated into the pipeline

Architecture

[Stage A] Gemma 4 31B (vllm, port 8894) → plan.json (10 beats + hook)
[Stage B] HiDream-O1-Image (port 8895) → 11 images at 2048²
          + Gemma 4 31B multimodal visual judge (--judge --max-retries 2)
[Stage C] Irodori-TTS (port 8880) + LTX-2 A2V (port 8892) / I2V (port 8891)
          → 11 clips + Hook clip → ffmpeg concat → subtitle burn-in

Enter fullscreen mode Exit fullscreen mode

Implementation lives under llm_server/storyboard/ (pipeline.py / visual.py / judge.py / video.py / render.py / run.py).

The 10-Beat consent_dilemma Format

Fixed as a system prompt via CONSENT_DILEMMA_SYSTEM in prompts.py:

# type speaker renderer content
1 provocation b LTX-2 A2V Suggestive invitation
2 ask a LTX-2 A2V Earnest consent check
3 refusal b LTX-2 A2V Soft refusal (ambiguous form like "No… stop it…")
4 dejection a (silent) LTX-2 I2V Dejection
5 gaslight b LTX-2 A2V Contradictory leading statement
6 pause a (silent) LTX-2 I2V Brief realization
7 kiss a (silent) LTX-2 I2V The moment of the kiss
8 verdict judge LTX-2 A2V Fast-paced court verdict
9 gavel_se judge LTX-2 I2V (keep_audio) Gavel + AI-generated "Knock!" sound
10 jail a (silent) LTX-2 I2V Tears in a jail cell

Three key structural choices:

  1. Don't make the refusal a flat "No": Stretch it into something like "No… stop it…" with trailing inflection, conveying the "performative No that doesn't mean No" nuance. This is what makes the gaslight's contradiction land later.
  2. Don't jump straight from gaslight to kiss: Insert a "pause" (realization beat) of ~1.5 seconds. This controls tempo and the emotional curve.
  3. Two-stage punchline — verdict then jail: The verdict alone feels abrupt. Showing him crying in a cell makes "he actually got convicted" click.

Hook Design (The TikTok 3-Second Problem)

On portrait short-form video, drop-off is decided in the first 3 seconds. A Hook segment is prepended before the 10 main beats:

"hook": {
  "title_overlay": "No Means Yes?",
  "narrator_line": "The fate of the man who answered 'You're a guy, aren't you'——",
  "image_prompt": "ultra close-up of beautiful Japanese woman, half-lidded eyes, ...",
  "duration_sec": 3.5
}

Enter fullscreen mode Exit fullscreen mode

Two implementation pitfalls:

Pitfall 1: narrator TTS duration exceeds duration_sec, cutting the audio. The final syllable of the narrator line got clipped. Fix: generate TTS first → measure with ffprobe → pass max(plan_duration, narrator + 0.6) as the I2V duration.

narrator_dur = _ffprobe_duration(narrator_wav)
duration = max(float(hook.get("duration_sec", 0.0)), narrator_dur + 0.6)
ltx_i2v_clip(portrait, i2v_prompt, duration, silent_video, keep_audio=False)

Enter fullscreen mode Exit fullscreen mode

Pitfall 2: drawtext y position. y=h*0.30 (one-third down the screen) overlapped the face. Changed to y=20 (absolute 20 px) to pin the title to the very top.

Subtitle Burn-In (Silent Viewing Support)

Burned-in subtitles for users watching without sound on the train, and for cross-platform reliability.

style = (
    "FontName=Noto Sans CJK JP,FontSize=18,PrimaryColour=&H00FFFFFF,"
    "OutlineColour=&H00000000,Outline=2,Shadow=0,BorderStyle=1,"
    "Alignment=2,MarginV=60,Bold=1"
)
# ffmpeg -i raw.mp4 -vf "subtitles=subs.srt:force_style='..."

Enter fullscreen mode Exit fullscreen mode

Alignment=2 = bottom center. MarginV=60 gives breathing room from the bottom edge.

Long-line splitting: A line of 30+ characters within one beat covers the face. _split_subtitle splits on 。.!? → greedy-packs into chunks of ≤28 characters → distributes beat duration evenly across chunks:

Input:

言葉で確認するのなんてロマンチックじゃないよね。ねえ、もっと積極的になってよ。男の子でしょ?

Output (one 8.9s beat split into 2 timed chunks):

Time Subtitle
15.16–19.63s 言葉で確認するのなんてロマンチックじゃないよね。
19.63–24.10s ねえ、もっと積極的になってよ。男の子でしょ?

Using LTX-2 I2V as a Sound Effect Generator (gavel_se)

LTX-2 distilled embeds AI-generated audio (ambient sound / sound effects) directly into the I2V output mp4. Unless you explicitly drop it with ffmpeg -map 0:v:0 -map 1:a:0, whatever the prompt describes comes with sound.

I repurposed this as an SFX generator:

def render_se_tail_beat(sb_dir, beat, prior_clip, work_dir):
    # 1. Extract the last frame of the previous beat
    extract_last_frame(prior_clip, last_frame_png)
    # 2. Feed that image into I2V, request SFX via prompt
    prompt = build_gavel_se_prompt(beat)
    return ltx_i2v_clip(last_frame_png, prompt, duration, clip_path, keep_audio=True)

Enter fullscreen mode Exit fullscreen mode

Added a keep_audio=True flag to ltx_i2v_clip so the audio isn't dropped during ffmpeg re-encoding.

Prompt for gavel_se:

"Single decisive arm motion of the judge bringing the gavel down sharply "
"onto the wooden bench. Loud sharp wood-on-wood thwack impact sound. "
"Brief, contained, no other motion in the frame."

Enter fullscreen mode Exit fullscreen mode

Last frame of the judge + gavel prompt → "Knock!" sound. If that misses, the design falls back to something like the Ace Attorney SFX.

Pitfall Log

Five major pitfalls hit during development:

1. Codex CLI hangs with vLLM 0.20.2

Sending a system prompt + idea via codex exec -p gemma4 hung at 0% CPU for 20+ minutes during the /v1/responses handshake. Piping subprocess output through tail -200 was also suppressing early stderr.

Fix: Dropped Codex entirely, hit /v1/chat/completions directly with urllib.request. Used response_format={"type":"json_object"} to force JSON. plan.json generated in 25 seconds.

2. HiDream won't remove the cinema screen

Even with "The movie screen is BEHIND the camera and NOT VISIBLE in frame" in the setting prompt, the screen persisted in the background through 2048/50 steps.

Fix: Generate scene_base via T2I → feed that same image into I2I edit with a prompt to "replace screen with dark wall, keep character positions identical" → gone in one shot. Two-stage pipeline: low-res → I2I fix → regenerate all beats at full resolution.

3. HiDream turns lips-on-lips into a cheek kiss

With standard prompting, HiDream tends to interpret kiss as a cheek kiss. You need directives at the level of "CRITICAL: their LIPS meet directly — mouth-to-mouth contact at the CENTER of the frame. NOT a cheek kiss". Added a dedicated early-return block in _beat_edit_prompt for the kiss beat.

4. CAST / CROP_BOX / SPEAKER_A2V_PROMPT are hardcoded for two characters

Three dictionaries — CAST, CROP_BOX, SPEAKER_A2V_PROMPT — only know a (Kenta) and b (Misaki). Adding judge/narrator requires updating all three simultaneously (you find out via KeyError). Also added branching in render_speech_beat_ltx_a2v so beats with setting_override crop from the beat's own image rather than scene_base.

5. Gemma 4 multimodal judge has too many false positives

storyboard/judge.py sends beat images + expected expressions to Gemma 4 31B for YES/NO visual judgment. It does catch obvious failures like wrong finger count, open-mouth pose on a silent beat, or scene geometry mismatch — but hammers FAIL on subtle cases like "subtle shy expression."

In practice: accept and proceed after 3 consecutive FAILs with max-retries 2. Automating the threshold for escalating to a frontier reviewer (Gemini 3.1 Pro) is still a TODO.

VRAM Layout

Breakdown on a 96 GB Blackwell Max-Q:

Process idle (GiB) peak (GiB)
Gemma 4 31B (NVFP4) 38 38
HiDream-O1-Image 16 33
TTS server 3 3
Ditto 3 3
LTX-2 A2V (cold-start fp8-cast) 1 24
LTX-2 T2V/I2V (cold-start) 1 8

All at peak simultaneously = 109 GiB → OOM. Operational flow:

  1. Stage A: Gemma 31B + HiDream idle → peak ~62 GiB
  2. Stage B with judge: Gemma 31B + HiDream peak → ~73 GiB
  3. Before final render: pkill -f "vllm.*gemma" kills Gemma → 38 GiB freed
  4. Stage B final render (2048/50): HiDream peak ~33 GiB
  5. Before Stage C: lsof -ti tcp:8895 | xargs kill kills HiDream → 16 GiB freed
  6. Stage C: LTX-2 + TTS + Ditto → peak ~32 GiB

Explicit kills at stage transitions, and everything fits on one card.

Iteration Loop (Cache Strategy)

Partial regeneration — not "rebuild everything" — is what keeps iteration fast:

# Regen a single beat image (HiDream only)
python -m storyboard.visual --plan ... --out ... --only-beat 7 --steps 50 --resolution 2048

# Partial video regen (TTS + LTX-2)
python -m storyboard.video --dir ... --regen-beats 5,6,7 --skip-review

# Adjust only subtitle or Hook title position
rm _video_work/clip_00_hook.mp4 _video_work/subs_irodori.srt
python -m storyboard.video --dir ... --regen-beats none --skip-review   # ~30 seconds

Enter fullscreen mode Exit fullscreen mode

Cache hierarchy:

  • HiDream beat images (beat_NN_<type>.png) — regenerate individually with --only-beat in ~80 seconds
  • A2V / I2V clips (clip_NN_*.mp4) — invalidated when beat type / speaker / line changes
  • Finished Hook clip (clip_00_hook.mp4) — delete just this when adjusting title position (the heavy LTX-2 I2V hook_silent.mp4 is reused)
  • Subtitle SRT — regenerated every time (~10 seconds)

Title position / subtitle style / Hook copy tweaks re-render in 30 seconds. The 100-second LTX-2 I2V portion stays cached.

How This Fits Into Kotonia

Videos generated by this pipeline feed the SNS distribution layer (TikTok / YouTube Shorts / IG Reels) — the top of the funnel for attention → conversion for Kotonia (kotonia.ai).

Technically, it's an extension of the /studio/ stack (HiDream image generation) into the video direction. The plan is to eventually expose this as /video-studio/ — a one-click Web UI over the same pipeline. Right now it's CLI only.

Related Articles / Want to Try It?