惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Apple Machine Learning Research
Apple Machine Learning Research
The GitHub Blog
The GitHub Blog
Hugging Face - Blog
Hugging Face - Blog
阮一峰的网络日志
阮一峰的网络日志
爱范儿
爱范儿
量子位
宝玉的分享
宝玉的分享
人人都是产品经理
人人都是产品经理
博客园_首页
博客园 - 【当耐特】
Last Week in AI
Last Week in AI
Martin Fowler
Martin Fowler
Microsoft Azure Blog
Microsoft Azure Blog
美团技术团队
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
aimingoo的专栏
aimingoo的专栏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
GbyAI
GbyAI
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
腾讯CDC

DEV Community

Membangun Kompetensi dan Relasi: Mengapa Ekosistem Kampus Itu Penting I Built an AI That Decides Which AI to Talk To — Running 24/7 From My Living Room Codex Team Usage SOP How to Actually Become a Programmer: The Hard Part Nobody Wants to Explain Building a Production-Style Multi-Tool AI Agent with Python, Flask, React & Gemini AI The Caretaker Sandbox: An Offline-First Visual Playground & Template Engine powered by Gemma 4 # Building Instagram OSINT Projects with HikerAPI Your AI can read. Gemma 4 can see The Battle of the Senior Dev: Why AI Gives You Wings But Only If You're Ready to Pilot I Finally Finished a Project I Abandoned — And GitHub Copilot Helped Me Ship It SafeSMS: On-Device Threat Detection with Gemma 4 E4B, no internet required I Built OpenKap — A Loom Alternative for Small Teams Who Just Want to Ship Gemma 4 is Here: The Dawn of Local Multimodal Reasoning Offline-First Flutter: How We Built a CRM That Manages 100K+ Leads With No Internet Memory for Agents: When Vectors Meet Graphs, Bugs Drop 4 The Rise of Production-Grade AI Infrastructure I ran my idea-validation product through its own validator. The verdict was PIVOT. We Built an Agent Commerce API. Google I/O 2026 Changed Our 3-Month Roadmap in 24 Hours. "My Partner's Memory Was Full. I Didn't Know — Until We Tried to Talk." I’m a Front End Web Developer Learning Machine Learning From Scratch Laravel Waiting Request I Built a Chrome Extension to Track How Long You Actually Spend on Each Tab Why Google Can't See Your React Breadcrumbs (And the 4-Line Fix) AI Travel Assistant Powered by Gemma 4; With Streaming, Image Input, and Visual Recommendation Cards Microsoft tried to kill the printer driver. Healthcare said no. The Blueprint Beneath the Blueprint: Designing Data Model and Choosing Its Database REST APIs vs Webhooks in Telecom Billing - Which One Actually Makes Sense? Accounting Made Simple: AI-Powered Financial Insights of Japanese Companies with Gemma 4 The append-only AST trick that makes Flutter AI chat actually smooth Designing the Future of Payments — Why XML Still Matters in the Age of APIs From Legacy to Live — Reviving XMLPayments with GitHub Copilot Two Weeks Into Learning Solana XMLPayments — The Hidden Backbone of Modern Financial Orchestration AI Agents in Practice — Read from the beginning Reviving My Gemma Agentic Framework: From Prototype to Polished Repo Smart Contracts Demand Better Infrastructure: Building on contract.dev Self-Hosted LLM Tool Calling: Forge and the Build-vs-Buy Decision ORA-00072 오류 원인과 해결 방법 완벽 가이드 OpenWA for CTOs: Self-Hosted WhatsApp Gateway Trade-Offs NotebookLM Automation With notebooklm-py: Useful, But Classify Data First Docker v29.5.x Operator Upgrade Checklist Coding-Agent Instruction Design: The CLAUDE.md File That Prevents Rework When I Finally Realized My Runtime Was Holding Me Back GnokeOps: Host Your Own AI House Party The Death of Static Rate Limiters: Why Your Java Virtual Threads Need BBR-Style Adaptive Concurrency AI Agents in Practice — Part 2: What Makes Something an Agent Stop scattering LLM SDK/API calls across your codebase. Here is the 2-file rule that fixed mine Beyond Prompts: Structuring AI Workflows for Real Frontend Engineering From an Abandoned Hackathon Project to an AI Study Workspace 🚀 Terraform with AI: Build AWS Infra (Cursor + MCP) What If AI Didn’t Need the Internet? 750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek's Permanent Price Cut You're Renting Someone Else's Compute — And It's Costing You More Than You Think CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago Five Clusters. Five Lessons. One Production System. Synaptic: A Local-First AI Dev Companion That Remembers How You Think Revolutionizing Edge MedTech: Building a Sovereign Sleep Apnea Companion ("XiHan Snore Coach") with Gemma 4 HDD Eksternal Tiba-Tiba Tidak Bisa Diakses di Windows? Ini Tiga Lapis Fix-nya DMARC p=none vs p=quarantine vs p=reject: what to use and when DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes I solo-built a reputation layer for AI agents on NEAR — and here's what I learned I built an AI faceless video generator in 2 months — here's the stack Diffusion Language Models: How NVIDIA Nemotron-Labs Diffusion Shatters the Autoregressive Speed Ceiling llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts From the Renaissance to the Quantum Dawn: AI, Computation, and the Next Paradigm Shift How I Built a Review Site with 800+ Articles Using AI I Built a Smart Kitchen AI with Gemma 4 That Turns Fridge Photos Into Recipes Why your vulnerability dashboard is lying to you (and how to fix it) From Abandoned Prototype to Smart AI System: Reviving Trafiq AI with GitHub Copilot Why Country/State/City Pickers Are Weirdly Hard Node.js 22 LTS — EOL Date, Support Timeline, and What Comes Next The 7-Layer Memory Architecture Behind Modern AI Agents I Imagined Hermes Agent Running an Entire Smart City — And It Changed How I See AI One backend, four products: why we bet on platform-per-brand AI's tech debt is invisible — even to AI. I solved it at the architecture layer. Why ROAS 300% Can Still Mean Losses — Gross Margin in 5 Ecommerce Verticals You Don’t Need to Try Every AI Tool to Keep Up NovelPilot: A Novel Writing Agent Powered by Gemma 4 BoxAgnts is an Out-Of-The-Box Secure AI Agent ToolBox in a WASM SandBox Gemma 4 deep dive: why a 1.5 GB model scores 37.5% on competition mathematics, how the MoE routing actually works, and which model fits your hardware. Full breakdown inside. BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090 Google Just Declared the Chat-Log Interface Dead. Here's What Neural Expressive Actually Signals for Developers. ARCHITECTURE SPECIFICATION & FORMAL SYSTEM REPORT: k501-AIONARC Notes from a Hammock What's Google Antigravity 2.0 ? Here's What the Agent Harness Actually Changes for Developers. Building an E2EE Chat App in Flask - Part 3: Keeping File Uploads Safe Google's Gemini Spark. Here's What It Actually Does for Developers. Microsoft Just Shipped MCP Governance for .NET. Here's What It Actually Enforces. How I Built a Pakistan Internet Speed Test Platform at 16 How to Build a Supervisor Agent Architecture Without Frameworks I Built My Own Corner of the Internet — Here's What It Looks Like How does VuReact compile Vue 3's defineExpose() to React? Neo-VECTR's Rift Ascent Idempotency Keys: The API Safety Net You Probably Aren't Using Building E-Commerce Sites for Niche Products: Technical Lessons from Specialty Outdoor Retailers Audit Logs: The Silent Guardian of Every Serious System Open-source SDS tooling for Japanese MHLW compliance: the gap nobody filled BetAGracevI I Built a Post-Quantum Cryptographic Identity SDK for AI Agents — Here's Why It Needs to Exist Running Claude Code across multiple repos without losing context
HiDream Raw Output Failed Tried Dev-2604 VRAM Math Killed It Won with a Prompt Enhancer Instead
shinji shimi · 2026-05-23 · via DEV Community

TL;DR

  • HiDream-O1-Image 8B Full raw outputs collapse on plain Japanese prompts — both instruction-following and aesthetics fail at once
  • Tried to swap to Dev-2604 (preference-tuned, 3.5× faster). It's better aesthetically but the gap is small in our use case, and worse — the 96GB GPU can't host both models alongside the rest of the stack
  • Pivoted away from model swap entirely. Stuck with Full + a Gemini Flash Lite prompt enhancer that bolts aesthetic polish on top
  • Along the way, found four non-obvious HiDream pitfalls (brand names get rendered as literal text, "cute" triggers childlike body bias, "Wong Kar-wai" hallucinates Korean captions, "idol-class" auto-generates caption text) — all baked into the enhancer's system prompt
  • Same plain Japanese prompt now produces a usable photoreal or anime variant from a single click. No model swap, no extra VRAM, no extra latency.

Act 1: "Raw output is busted"

Kotonia Studio runs HiDream-O1-Image 8B Full on a local GPU (RTX PRO 6000 Blackwell Max-Q, 96GB) and offers free T2I. Normally outputs are clean. But one day, a plain Japanese prompt — "a cute woman in a cheongsam, holding a fan, smiling" — returned this:

raw-kimono-failure

What went wrong:

  • Asked for a cheongsam, got a kimono. Chinese attire drifted to Japanese.
  • Face isn't pretty. We wanted idol-class beauty.
  • Composition is generic full-body in a Kyoto-style garden. We wanted a closer crop showing the fan texture.

HiDream-O1 is a top-tier OpenWeight model — careful English prompts produce magazine-grade 2048×2048 outputs. So this isn't "the model is bad." It's a gap between user input and OpenWeight model expectations. Frontier models (Gemini Imagen / DALL-E / Midjourney) absorb natural-language nuance internally. OpenWeight models expect you to throw the prompt straight at them.

Either give up on the raw-output UX, or do something about it.

Act 2: Maybe Dev-2604 will save us?

Then I noticed HiDream-O1-Image-Dev-2604, a new variant released in May 2026. Debuts at #8 on the Artificial Analysis T2I Arena, runs 3.5× faster at 28 steps with no CFG.

Arena ranks models on human aesthetic preference. So Dev should be preference-tuned for "what looks good."

Hypothesis:

  • Dev returns magazine-grade output even on vague Japanese prompts
  • 3.5× speed improvement makes /studio snappier
  • Best case: deprecate Full, run Dev only

Phase 1 bench: 5 generic cinematic prompts (Tokyo izakaya, Bangkok night market, anime character, text-in-image, portrait), Full vs Dev-2604:

mode Full (s) Dev-2604 (s) speedup
T2I (avg) 33.1 9.5 3.5×
Edit (avg) 79.0 22.2 3.6×
IP 84.3 23.8 3.5×

On generic prompts, Dev is faster and impressionistically nicer. "OK, Dev is the answer" — that's where I almost stopped at the end of Phase 1.

Act 3: But on the use case, the gap is thin — and Edit performance drops hard

I almost locked in a wrong conclusion. Kotonia's actual strategy is "comedy-style short videos with idol-class beauty hooks." The fact that Dev wins on generic cinematic doesn't mean it wins on character-driven comedy with expression specificity.

Built 8 new prompts inspired by Grok-generated reference images (cinematic editorial Asian beauty / anime qipao / cinematic hanfu / cosplay maid / etc), in vertical 1440×2560 (9:16) framing, and re-benched.

Some of the Grok reference images (the level of polish we wanted to match):

Editorial portrait Cinematic hanfu
grok-ref-editorial grok-ref-hanfu

The bench result was Full wins on instruction-following:

  • editorial portrait: tied; Dev maybe a touch nicer aesthetically
  • anime qipao: Full's cell-shading wins decisively. Dev drifts to semi-realistic and ignores the "anime" instruction
  • hanfu brocade: Dev hallucinated the literal word "SAVE" onto the parasol (text artifact)
  • comedy surprised face: Full produces a more cartoonish exaggerated expression + readable caption text
  • comedy deadpan: Full nails the "really?" deadpan expression with crisp eyeliner

Dev-2604 traded instruction-following for aesthetic polish. It was preference-tuned on magazine-style fashion photos — so on non-magazine use cases, it pulls outputs back toward "magazine-looking" against the prompt's intent.

"Both fine, marginal gap" example: editorial portrait

The category I marked "tied" — same portrait prompt, Full vs Dev outputs side by side:

Full (tight crop, dramatic) Dev-2604 (wider, magazine-polished)
portrait-full portrait-dev

Full leans high-contrast and moody (window-side Rembrandt light, dark library background). Dev leans soft and editorial (seated half-body, natural light, smoother skin retouch). Both are usable; Dev is slightly gentler. That's it.

Not enough of a gap to justify the cost of model swapping (VRAM, load time, architectural complexity). That's the conclusion Phase 2 drove me to.

The decisive blow: Edit and IP performance crater

Generic T2I alone might have left Dev viable. But the gap on Edit and IP (character consistency) was stark, and that's what finally killed the model-swap idea.

We took a T2I output with three people in a dark alley with lanterns, and ran the Edit instruction Same scene, same characters, same composition. Change the weather to a heavy rainy evening; the characters now wearing translucent rain ponchos.

Full (scene preserved, weather changed) Dev-2604 (abandoned the source scene entirely)
edit-full-weather edit-dev-weather

Full followed the instruction: three people, rain ponchos, rainy alley. Dev replaced the reference entirely with a single woman in a kimono at a snowy temple gate — neither following the text instruction nor preserving any structural detail from the reference. This is past "weak edit fidelity"; it's "not functioning as an edit."

IP (character consistency) showed the same pattern. We handed the model two face photos and asked for "the same two people standing together on an autumn path in Kyoto."

Full (identities mostly preserved) Dev-2604 (different people generated)
ip-full-cast ip-dev-cast

Full keeps the two faces recognizable. Dev generated two different people. The preference-tuning likely prioritizes "produce pretty faces" over "preserve the reference's identity."

The official README spells this out: For editing tasks we recommend using the full model. Phase 1 timing was Full 79s / Dev 22s — fast, but Dev's outputs are unusable for Edit/IP.

So Dev isn't a clear win. But it's not a clean loss either — it's faster (3.5×), and on cinematic atmosphere shots it does look better. Maybe I need to use both, switched per use case?

Act 4: VRAM math kills "use both"

"Just keep both models resident on GPU" sounds clean. Then I actually pulled up the GPU memory budget for the single 96GB GPU we run everything on:

Co-resident process resident VRAM peak VRAM
E4B (reviewer LLM) 19.6 GB 19.6 GB
31B Gemma 4 NVFP4 (orchestrator) 38.0 GB 38.0 GB
TTS server (Irodori + Whisper) 9.6 GB 9.6 GB
Ditto-TalkingHead 3.0 GB 3.0 GB
LTX-2 A2V (cold-start, fp8-cast) 0.9 GB 24.0 GB (during inference)
HiDream Full (resident) 16.4 GB 17.3 GB
Total 87.5 GB 111.5 GB ← when LTX-2 fires

The moment LTX-2 video generation fires, we're already right at the OOM line on a 96GB GPU. Adding Dev-2604 as a second resident model means +16.4 GB → total 127 GB → impossible.

Options enumerated:

  1. Both resident: impossible (OOM, see above)
  2. Both cold-start: +22s load per request (vs 33s inference, that's a big hit. Idle 0GB is nice but first-touch UX collapses)
  3. Dev resident + Full cold-start: Dev as primary + Full for edit/IP. But Phase 2 invalidated that premise
  4. Full resident + Dev cold-start: Occasionally switch to Dev, eat 22s load each time
  5. Drop Dev, keep Full only: status quo, no speedup gained

From a service-viability standpoint, options 1-4 all sacrifice either "make free users wait 22s extra" or "shrink VRAM headroom so LTX-2 / 31B can't run." Running a single GPU for one solo operator means budget is tight: Dev's marginal aesthetic gain doesn't justify breaking the rest of the stack.

I decided to abandon the model-swap path entirely.

Act 5: Can we just beat this with prompts?

Step back. What was Dev actually winning on?

Just aesthetic polish. Instruction-following is better on Full.

So if I can keep Full's instruction-following while bolting aesthetic polish onto the output, model swap isn't needed.

Concrete approach: append an aesthetic anchor (a "magic suffix") to the prompt to steer Full's output toward magazine-quality.

Trade-offs:

  • ✅ Zero VRAM cost (Full only)
  • ✅ Inference time unchanged (33s/image)
  • ✅ Edit/IP/skeleton/layout still work on Full (avoiding the Dev performance cliff from Act 3)
  • ✅ No 22s Dev cold-start penalty
  • ⚠️ Risk: do anchors actually work?

Phase 3 — tried 4 anchor variants on Full:

  • v1 Lindbergh: "Vogue cover composition, Peter Lindbergh editorial photography..."
  • v2 cinematic: "Roger Deakins anamorphic, blockbuster color grade..."
  • v3 K-beauty: "Vogue Korea / ELLE Korea aesthetic, glass-skin glow..."
  • v4 combined: kitchen-sink

3 base prompts × (baseline + 4 anchors) = 15 generations. And three deeply non-obvious HiDream behaviors surfaced.

Pitfall 1: Brand names get rendered as literal text on the image

Any anchor containing "Vogue" or "ELLE" produced outputs with "VOGUE" appearing in printed magazine-cover text on the image itself — top-right corner, in front of the subject. Worse on anime: the cel-shaded character had a magazine layout overlaid on top.

HiDream-O1 is SOTA on CVTG-2K (complex visual text generation). The strong text-rendering training means any brand name in the prompt gets a near-guaranteed shot at being literally generated as text on the canvas.

Strip brand names from anchors completely. Photographer/director names like Lindbergh, Deakins, Mihoyo are safe — trademarks are landmines.

Pitfall 2: Photoreal anchors contaminate anime outputs with magazine paper

When anime base prompts were paired with photoreal anchors (v1-v4), the output looked like a cel-shaded anime character with a literal VOGUE magazine cover layout overlaid on top.

When style hints conflict, diffusion models physically overlay both elements rather than blending them.

Anime needs its own anchor family (Mihoyo / Kyoto Animation / theatrical anime style) — never reuse photoreal anchors.

Pitfall 3: "Wong Kar-wai" → Korean text hallucination on photoreal scenes

The v5 grok-direction anchor included "Wong Kar-wai-style color grade", and the output rendered Korean text "신부의 아안" etc on the photoreal scene.

Wong Kar-wai is a Hong Kong director with no Korean connection. But the model's internal "Asian arthouse cinema" association routed toward Korean and surfaced as printed text. Director names carry similar risk to brand names — A/B before adopting.

Act 6: Defuse the "cute → child" bias, ship it

Phase 4 rewrote the anchor library:

  • All brand names stripped
  • Only A/B-verified safe names retained (Lindbergh, Deakins, Mihoyo)
  • Separate anime anchor family added (Mihoyo / Kyoto Animation)
  • Anime anchors include "mature young-adult character proportions" to defuse the "cute" → childlike-body bias (a behavior the user had spotted before I even ran the bench)

Re-benched result:

  • photoreal portrait: v3 K-beauty clean — no VOGUE leakage, glass-skin + cinematic light
  • anime: v7 Mihoyo anchor — no magazine contamination, adult proportions preserved
  • ⚠️ comedy caption text handled separately (embrace auto-caption when wanted, post-overlay otherwise)

"Full + cleaned anchors" locked in. Time to wire it into the product.

Implementation: /api/studio/enhance (Gemini Flash Lite)

Added an enhance endpoint in backend/src/handlers/studio.rs. Backed by gemini-3.1-flash-lite (cheap API), not the local 31B Gemma. Why:

  • The 31B local model is 38GB resident — the VRAM budget above already ruled out adding more local LLM weight
  • Flash Lite is $0.075/M input + $0.30/M output. One enhance is roughly 800 in + 400 out tokens = ~$0.0002/call. Effectively free
  • Zero VRAM impact: adding this feature doesn't compete with the rest of the GPU stack

System prompt encodes everything from Phase 1-4:

const ENHANCE_SYSTEM_PROMPT: &str = r#"You are a prompt enhancer for HiDream-O1-Image.

Rules (learned from A/B benchmarking):

1. NEVER include brand names ("Vogue", "ELLE", "Nike") — HiDream renders them
   as literal text overlays.
2. NEVER use "Wong Kar-wai" — triggers Korean text hallucination.

3. For photoreal portraits, append:
   " High-end Korean fashion magazine photoshoot aesthetic, professional
     beauty retouch, glass-skin glow, ..."

4. For anime / cell-shaded / illustration, append:
   " In the visual style of Mihoyo / HoYoverse key art, semi-painterly cel
     shading, ..., mature young-adult character proportions ..."
   ALSO: if the prompt has "cute girl" / "kawaii girl" without age qualifier,
   normalize to "young woman in her early twenties with adult proportions".

5. For cinematic scenes, append cinematic CG realism anchor (no Wong Kar-wai).
6. For text-design prompts, append no suffix.

Output JSON: { "detected_style": "...", "anchor_applied": "...",
              "enhanced_prompt": "..." }
"#;

Enter fullscreen mode Exit fullscreen mode

UI side: a small "✨ Enhance" button above the prompt textarea on /studio. Click → POST /api/studio/enhance → swap textarea contents for enhanced_prompt + green banner showing detected style + undo link.

Act 7: Won

Same plain Japanese prompt that produced the kimono failure earlier, now run via the Enhance button:

Photoreal anchor applied

enhanced-photoreal

Cheongsam intact, close-up framing, idol-class face, glass-skin retouch, magazine lighting.

Anime anchor applied

enhanced-anime

Cel-shaded anime style, Chinese architectural courtyard background, adult proportions preserved, fan texture kept.

Same plain Japanese prompt → photoreal and anime variants, one click each. Single model, zero extra VRAM, identical inference time.

Takeaways

Engineering judgment lessons from this exercise:

  • "Model swap" and "prompt engineering" should be compared on the same budget. Without a frontier model, VRAM and service viability constraints dominate model selection. In this case, preserving Full's resident slot was a higher-priority constraint than Dev's aesthetic edge.
  • A/B bench in two stages. Generic prompts → tentative conclusion → use-case prompts → reversal. That's exactly what Acts 2-3 of this story were. Stopping at one stage means you ship the wrong conclusion.
  • Proper nouns are landmines. Models with strong text-rendering training will literally bake trademarks and director names into the canvas. A/B every name before adopting.
  • Cheap LLM prompt enhancers are the strongest move under VRAM pressure. $0.0002/call for a noticeable UX bump. Adding more local LLM weight starves the rest of the stack.
  • Anime and photoreal need separate anchor families. Style hints that conflict get physically overlaid, not blended.

What's next

  • LoRA training: prompt engineering hits a ceiling on anime. Train a custom anime LoRA on HiDream-O1 and let users swap LoRAs per use case ("comedy character expressions," "vertical 9:16 idol portrait," etc).
  • Composition diversity: current anchors over-bias toward "indoor magazine shoot." Need explicit outdoor / urban / cinematic-location variants.
  • A/B testing in prod: instrument /admin/analytics/ to measure enhance-on vs enhance-off retry rate and conversion.

Even for OpenWeight diffusion models, one layer of prompt engineering above the model is enough to lift "raw output failure" into "production quality." If you're putting HiDream-O1-Image into production, dodge these four pitfalls and you're 80% of the way there.


The implementation runs live at kotonia.ai/studio — the "✨ Enhance" button sits above the prompt textarea. Free to try.