惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Google DeepMind News
Google DeepMind News
F
Fortinet All Blogs
阮一峰的网络日志
阮一峰的网络日志
Apple Machine Learning Research
Apple Machine Learning Research
爱范儿
爱范儿
WordPress大学
WordPress大学
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
J
Java Code Geeks
罗磊的独立博客
S
SegmentFault 最新的问题
V
V2EX
V
Visual Studio Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
美团技术团队
博客园 - 三生石上(FineUI控件)
Stack Overflow Blog
Stack Overflow Blog
Y
Y Combinator Blog
MyScale Blog
MyScale Blog
D
Docker
Google DeepMind News
Google DeepMind News
Blog — PlanetScale
Blog — PlanetScale
M
Microsoft Research Blog - Microsoft Research
Martin Fowler
Martin Fowler
S
Secure Thoughts
B
Blog
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Recent Announcements
Recent Announcements
MongoDB | Blog
MongoDB | Blog
C
Cisco Blogs
C
CERT Recently Published Vulnerability Notes
T
True Tiger Recordings
GbyAI
GbyAI
P
Proofpoint News Feed
P
Privacy International News Feed
Jina AI
Jina AI
The Cloudflare Blog
I
Intezer
AWS News Blog
AWS News Blog
Hacker News - Newest:
Hacker News - Newest: "LLM"
S
Security Archives - TechRepublic
NISL@THU
NISL@THU
The Register - Security
The Register - Security
Recent Commits to openclaw:main
Recent Commits to openclaw:main
P
Palo Alto Networks Blog
S
Schneier on Security
L
LINUX DO - 热门话题
C
CXSECURITY Database RSS Feed - CXSecurity.com
Security Latest
Security Latest
C
Cybersecurity and Infrastructure Security Agency CISA

DEV Community

When the Cleanup Code Becomes the Project Rockpack 8.0 - A React Scaffolder Built for the Age of AI-Assisted Development Mismanaging the Treasure Hunt Engine in Hytale Servers Will Get You Killed Stop Calling It an AI Assistant. It’s Already Managing Your Company Why Hardcoded Automations Fail AI Agents Why I built a post-quantum signing API (and why JWT is on borrowed time) Weekend Thought: Frontend Build Tools Suffer From Work Amnesia A 10-Line Playwright Trick That Saved Me Hours on Every Sephora Run AI Is Changing Engineering Culture More Than We Realize Everyone Was Focused on Gemini, But Infinite Scaler Was the Real Twister "Gemma 4 Analyzed My Bank Statements – Apparently I 'Have a Problem' with Coffee and Late-Night Apps" #css #webdev #beginners #codenewbie The Hidden Layer Every AI Developer Must Learn AlphaEvolve: Google DeepMind's Gemini-Powered Evolutionary Coding Agent RDS Reserved Instance Pricing: Every Engine, Every Rule, Real Dollar Savings How To Build An AI-Powered MVP Without Burning Your Startup Budget In 2026 Reading a Psychrometric Chart Without Getting Lost LMR-BENCH: Can LLM Agents Reproduce NLP Research Code? (EMNLP 2025) How to turn text into colors (without AI) Building Real-Time Apps in Node.js with Rivalis: WebSockets, Rooms, Actors, and a Binary Wire This Week In React #282 : Security, Fate, TanStack, Redux, Jotai | Hermes-node, Expo, Rozenite, Harness | TC39, Bun, pnpm, npm, Yarn, Node AI Copilot vs AI Agent Architecture - What's Actually Different (And Why It Matters) Smart Contract Security: NEAR's Futures Surge and AI Token Risks Database Maintenance: Tracing Production Incidents to Their Root Cause Stop juggling AI SDKs in PHP — meet Prisma Google Quietly Changed What “Apps” Mean at I/O 2026 The Infrastructure Team Is the Real Single Point of Failure Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File The 4 Levels of Hermes Agent Scaling Framework: From One Hermes Agent to a Fully Automated Team Your AI Has a Memory. It Just Doesn’t Know What to Remember. Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6) Building a Daily Google News API Monitor in Python Building RookDuel Avikal: From Chess Steganography to Post-Quantum Archival Security Google I/O e IA: o que realmente muda na vida do dev? Color Contrast Failures: The Number One Accessibility Issue and How to Fix It # I Watched 15 Hours of Hermes Agent Videos So You Don't Have To Cómo solucionar el bucle infinito en useEffect con objetos y arrays en React The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose Most Treasure Hunts Engines on Hytale Servers Are Built to Fail - Lessons from a Burned Database GhostScan v3.0 — From Closed-Source EXE to Open-Source Pentest Framework De hojas de cálculo a IA: construyendo una plataforma SRM moderna When is AI fine in education? Python Tools for Managing API Rate Limits in Data Pipelines How to Implement Exponential Backoff for Rate-Limited APIs in Python "My Web Chat Wasn't a Real Channel. That Broke My Agent Pipeline" next-advanced-sitemap v1.0.7 — safer URL ingestion & automatic trimming for Next.js sitemap generation I keep seeing people build an AI lead processing agent when they really need a 6-step rules engine AI Powered Student Learning Assistant Using Gemma 4 How I Built a Drop-In Proxy to Slash My OpenAI Bills by 20%+ Automatically Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction Five Years Later, I Finally Have 96GB VRAM — What It Actually Unlocks for Agent Loops Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline Running LTX-2.3 Alongside TTS on a Single 96GB GPU with a Cold-Start Architecture Cutting LTX-2 22B Peak VRAM by 40% with fp8_cast — and Why optimum-quanto Was a Trap Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent HiDream-O1-Image 3–8x Faster: Benchmarking Steps, CFG, and Resolution AWS Savings Plan Buying Strategy: How to Layer, Size, and Time Commitments application.properties I built a macro tracker powered by AI + attitude Solace: A Global Mental Health First Responder Built with Gemma 4 Why Blocking Prompt Injection Is Wrong — and What to Do Instead The AI code tools Dutch developers actually use in 2026 (field notes) Automatic Error Recovery in AI Agent Networks You Are Not Choosing Building a Cinematic Adaptive Learning Intelligence with Gemma 4, Gemini, and OpenAI(Powered by Gemma 4) CLAUDE.md for Angular: 13 Rules That Make AI Write Idiomatic, Production-Ready Components I tested 7 vector databases for my RAG stack in 2026, here's the one nobody is talking about (yet) Claude agreed with a false fact I gave it. Confidently. That broke my workflow Google's "Budget" Model Just Beat Its Own Flagship. Here's What That Actually Means for Developers. How I built a monitoring SaaS for Joomla, WordPress & PrestaShop agencies Shifting from Passive Dashboards to Automated Remediation: A Guide to Next-Generation FinOps and CloudZero Alternatives Automating CSV WooCommerce Imports Without Plugins Why Wobbly Plugs and Overheating Outlets Are More Dangerous Than You Think (UL 498 Explained) Building an AI Model Evaluation Pipeline on AWS for Audio Content Generation Your Side Project Is Not a Business Neurodiversity and the two layers of cognition GitHub Internal Repositories Breached: Source Code and Internal Data Allegedly Exfiltrated in 2026 Supply Chain Attack Stop drowning in files: auto-organize your Google Drive with n8n (free workflow JSON) Secure Firmware Updates with a Secure Element: Building Trust Into the Bootloader I Thought Domain-Driven Design Was a Waste of Time. I Was Wrong. AI Content Is Getting Tagged Like Livestock — And That's Actually Good ESP32 Into a Speech-to-Text Device Why Simple Audio Transcription Fails in Healthcare: The Need for Clinical Reasoning Engines The 114KB Span Attribute That Hid Our LCP Data How to Scale AI Development Beyond Prototype Speed Agent Execution Environments: Cloud Sandbox vs Local GUI vs Hybrid AI code review checklist that actually catches problems What’s the best tech stack for AI app development? Arc 1 Recap: Keypairs, Wallets, and Solana Fundamentals How Wearables Are Changing Human Decision-Making (Without Us Realizing It) The Perils of Premature Optimisation in Distributed Treasure Hunts Why Engineers Wear Hoodies While Social Media Sells Perfection Stop Treating setTimeout(fn, 0) Like Magic Save any webhook data to a database automatically with n8n — free workflow JSON Translating an entire multilingual site shouldn't mean re-prompting an LLM for every file I built a Vite plugin that uses AI to author Playwright tests, then gets out of the way Project: Restaurant Delivery CRUD Three weeks after I said CLAUDE.md writes itself, it added 4 more rules without me Why On-Device AI Is Quietly Winning Over Cloud Inference — Three Reasons You Didn't See Coming Trois semaines après avoir dit que mon CLAUDE.md s'écrivait tout seul, il a ajouté 4 règles sans moi
HiDream Skeleton Mode: Prompt Beats OpenPose Ref — 8 Patterns Benchmarked
shinji shimi · 2026-05-22 · via DEV Community

TL;DR

After benchmarking HiDream-O1-Image (released 2026-05, OpenWeight 8B, ranked #8 on Artificial Analysis Text-to-Image Arena) across 8 skeleton (try-on) mode patterns plus 3 layout patterns, three counterintuitive findings emerged.

  1. Passing an openpose ref actually locks the pose to the ref's composition. When you want dynamic poses, dropping the openpose ref and specifying the pose via prompt is more effective.
  2. Using 6 refs (face + bg + pose + parts, the full set) compresses each ref down to 768px, degrading fine details. Keeping it to 3–4 refs maintains 1024px and produces better quality.
  3. The README-recommended shift=1.0 is strictly for try-on use. For pose/outfit swaps use shift=2.0-2.5; for complete scene replacement use shift=3.0.

Reading pipeline.py reveals that there is no dedicated code path for skeleton mode. Both /generate/skeleton and /generate/ip go through exactly the same multi-ref pipeline internally, and whether a ref is a face, background, openpose, or clothing is communicated only through the prompt. That's the root cause of everything.


Motivation

After running HiDream-O1-Image on a local GPU (RTX PRO 6000 Blackwell, 96 GB) and integrating it into our own platform, we hit a problem: skeleton (try-on) mode wasn't following prompt instructions. Writing "jump with both hands raised" only produced stiff, upright try-on photos.

Suspecting guardrails (NSFW filters, safety policies, etc.), I grepped for safety|nsfw|guard|filter|moderate|censorHiDream's codebase has none of that (the only hit was CSS backdrop-filter: blur). As expected from an MIT-licensed OpenWeight model, no censorship.

So what's actually wrong? Here's what I found after reading pipeline.py and running 8 + 3 patterns on real hardware.


Environment

  • GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (96 GB VRAM)
  • PyTorch: 2.12.0 + CUDA 13.0
  • flash-attn: 2.8.3 (sm_120-only build)
  • Model: HiDream-O1-Image Full (8B, bf16, ~16.4 GiB resident)
  • Inference server: custom Python BaseHTTPRequestHandler (port 8895)
  • Resolution: pipeline internal bucket forces snap to 2048×2048

Measured wall time per 50-step generation:

Mode Time iter speed
t2i (no ref) ~33s 1.52 it/s
edit (1 ref) ~76s 1.01 it/s
skeleton (multi ref) ~84s 1.34 it/s
ip (multi ref) ~76s 1.81 it/s
layout (multi ref + bbox) ~83s 1.21 it/s

Test Assets

The HiDream repo's assets/IP_skeleton/ includes a full skeleton set. These are used as-is for all tests.

ref Content Intended role
face Person's face photo Identity reference
openpose Stick figure in OpenPose format Pose specification
bg Background photo (interior) Scene reference
sweater boots Clothing parts (sweater, boots) Outfit reference

8-Pattern Skeleton Benchmark

Each pattern calls /api/studio/skeleton (i.e., generate_image() with skeleton-mode-equivalent arguments). All parameters except shift and guidance_scale are fixed (50 steps, seed=42).

A — Baseline (README defaults, all 6 refs)

curl -X POST http://localhost:8895/generate/skeleton \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
    "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
    "shift": 1.0, "seed": 42
  }'

Enter fullscreen mode Exit fullscreen mode

A_baseline

Result: The bg ref's walls and shelves are reproduced exactly. Pose also matches the openpose ref's upright stance. Faithful as a try-on, but zero freedom of movement.

B — Higher shift (same 6 refs, shift=2.5)

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Create a realistic try-on image of the person wearing the provided clothing.",
  "ref_image_paths": ["face","bg","openpose","part_1","part_2","part_3"],
  "shift": 2.5, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

B_shift25

Result: Shelves fade slightly, character design shifts a bit. Background still sticks to the bg ref. Raising shift alone can't fully break the bg ref's pull.

C — Raise guidance too (shift=2.5, guidance=7.0)

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "...",
  "ref_image_paths": [...6 refs...],
  "shift": 2.5, "guidance_scale": 7.0, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

C_shift25_g70

Result: Necklace deforms strangely. Raising guidance starts producing artifacts. The Full model's sweet spot is 5.0; 7.0 is too much.

D — Trim to 3 refs (face + openpose + sweater) + specific prompt

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "A young Asian woman wearing a gray oversized sweater dress, standing in a relaxed pose, full body shot, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","openpose","part_1"],
  "shift": 2.0, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

D_3refs_specific

Result: Major improvement. Background becomes a clean white studio, outfit is preserved, pose looks natural. Removing the bg ref made the biggest difference. This is what a correct try-on output should look like.

E — 4 refs + numbered-ref prompt

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body try-on photograph. Subject: the woman from image 1. Pose: identical to the skeleton in image 2. Wearing: the gray oversized knit sweater dress shown in image 3, brown leather ankle boots shown in image 4. Studio lighting, plain background.",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

E_numbered_refs

Result: Quality on par with D; boots reflected (somewhat subtly). Numbering refs in the prompt does help, but the effect isn't dramatic.

F — Drop openpose, specify pose via prompt

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body photograph of the woman wearing the gray sweater dress and brown ankle boots, dynamic dancing pose with both arms raised above her head, joyful expression, photo studio with white seamless background, professional lighting.",
  "ref_image_paths": ["face","part_1","part_2"],
  "shift": 2.5, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

F_pose_via_prompt

Result: 🏆 Both-arms-raised jump, complete success. Dynamic motion only appeared when the openpose ref was removed and the pose was specified purely via prompt. This confirms that the openpose ref suppresses prompt-driven pose.

G — Face only + freeform prompt (full outfit swap)

/generate/skeleton has a minimum-2-refs validation, so using /generate/ip:

curl -X POST http://localhost:8895/generate/ip -d '{
  "prompt": "Elegant full-body portrait of the woman wearing a vibrant red sequined evening gown with a thigh-high slit, standing confidently with one hand on her hip, soft cinematic lighting, dark blurred background.",
  "ref_image_paths": ["face"],
  "shift": 3.0, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

G_outfit_freeform

Result: 🏆 Red evening gown generated perfectly. Facial identity preserved; everything else is free. Face-only + shift=3.0 is the maximum-freedom pattern.

H — Same config as E, seed=999 (variance check)

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body try-on photograph. ...",
  "ref_image_paths": ["face","openpose","part_1","part_2"],
  "shift": 2.0, "seed": 999
}'

Enter fullscreen mode Exit fullscreen mode

H_seed999

Result: Marginal difference from E; boots come out more clearly brown. Varying the seed is useful for fine-tuning details, so in production, running 3–5 seeds and picking best-of-N is standard practice.


Layout Mode Quick Look (3 Bonus Patterns)

layout_bboxes lets you specify where multiple subjects appear in the image using relative coordinates [x1, x2, y1, y2]. Here's the actual behavior.

Input refs are face photos of two people (female, male):

ref female ref male

L1 — Side by side (female left, male right)

"layout_bboxes": "[[0.0,0.5,0.1,0.95],[0.5,1.0,0.1,0.95]]"

Enter fullscreen mode Exit fullscreen mode

L1

Result: Left and right were swapped (male left, female right). Correspondence between ref order and bbox order is not guaranteed.

L2 — Top/bottom split (female top, male bottom)

"layout_bboxes": "[[0.2,0.8,0.0,0.5],[0.2,0.8,0.5,1.0]]"

Enter fullscreen mode Exit fullscreen mode

L2

Result: Female appears in the background, male in the foreground — a depth-layered composition rather than a literal top/bottom split.

L3 — Size difference (female large, male small)

"layout_bboxes": "[[0.1,0.65,0.1,0.95],[0.7,0.97,0.05,0.45]]"

Enter fullscreen mode Exit fullscreen mode

L3

Result: Both subjects rendered at nearly the same size, side by side. Bbox size does not control relative scale.

→ Think of layout mode as a loose composition hint for group shots, not precise Photoshop-style placement. It gives a rough suggestion for fitting multiple subjects into a single image; don't expect coordinate accuracy.


Why This Happens — Reading pipeline.py

HiDream's behavior is governed by the generate_image() function in models/pipeline.py. Three structural facts explain everything.

1. More refs = lower per-ref resolution

pipeline.py:198-202:

if K == 1: max_size = max(height, width)         # 2048
elif K == 2: max_size = max(height, width) * 48 // 64   # 1536
elif K <= 4: max_size = max(height, width) // 2  # 1024
elif K <= 8: max_size = max(height, width) * 24 // 64   # 768
else: max_size = max(height, width) // 4         # 512

Enter fullscreen mode Exit fullscreen mode

Feeding 6 refs compresses each to 768px. Thin openpose lines, fine clothing patterns, and facial detail all get crushed. Keeping it to 3–4 refs preserves 1024px and retains that detail.

2. Skeleton mode has no dedicated code path

Looking at pipeline.py:178-275, there is no skeleton-specific branch. Both /generate/skeleton and /generate/ip run through exactly the same multi-ref path:

content = [{"type": "image"} for _ in range(K)]
content.append({"type": "text", "text": caption})
messages = [{"role": "user", "content": content}]

Enter fullscreen mode Exit fullscreen mode

The model receives no role hints indicating which ref is a face, which is an openpose skeleton, and which is clothing. All refs are treated as "K reference images in parallel." If you want roles to matter, you have to say so explicitly in the prompt text.

This is why "prompt beats openpose ref." The openpose ref is processed as "some line-art image among the references," with no explicit signal that it's a pose specification. Meanwhile, dynamic dancing pose with both arms raised in the prompt is parsed as explicit verbs and nouns at the vocabulary level.

3. How the shift parameter behaves

shift controls the noise schedule strength of the scheduler. In practice:

  • 1.0 = maximum fidelity to ref composition, zero freedom → try-on only
  • 2.0-2.5 = practical range, allows deviation from refs
  • 3.0+ = near-freeform generation, refs serve only as identity anchors

The README recommends 1.0 for IP/Skeleton/Layout because it assumes the typical try-on / character-consistency use case. If you want to change the pose, swap outfits, or build a new scene that differs from the refs, 2.0+ is required.


Best Practices by Use Case (Battle-Tested)

Goal Endpoint Refs Shift Notes
Faithful try-on matching original scene /skeleton 6 (face+bg+pose+3parts) 1.0 README default. Strongly faithful to all refs
Preserve outfit + natural standing pose /skeleton 3-4 (face + clothing, no bg/pose) 2.0 Dropping bg ref gives white studio; fewer refs keep each at 768→1024px
Dramatic pose change /skeleton 3 (no openpose) 2.5 Prompt controls motion better than openpose ref
Complete outfit swap /ip 1 (face only) 3.0 Maximum freedom; only face is preserved. Skeleton mode rejects < 2 refs
Group shot /layout Multiple face refs + rough bboxes 1.0 Bboxes are loose composition hints; size hierarchy doesn't work; ref↔bbox order not guaranteed
Fine detail optimization Same config Same Same Run 3–5 seeds and pick best-of-N

Summary

Treating HiDream-O1-Image's skeleton mode as a "try-on simulator" leads to the frustrating feeling that "it won't listen" — with no guardrails to blame. The real cause is pipeline structure: refs lose resolution as count increases, there's no skeleton-specific processing, and shift controls how hard the refs pull.

Practical takeaways:

  • Try-on: 6 refs full + shift 1.0 (README default)
  • Changing the pose: drop openpose ref + verb-describe the pose in prompt + shift 2.5
  • Completely free scene creation: face only + shift 3.0 + /ip endpoint

Layout mode also makes sense once you understand it as "group photo hint" rather than "precise bbox placement."

All assets and commands used in this benchmark come from the HiDream-O1-Image repository's assets/IP_skeleton/ and assets/IP_layout/ directories, so results are fully reproducible. Varying shift and ref count alone produces dramatically different behavior — it's a good sandbox for developing intuition quickly.


Addendum: What Happens When You Change the OpenPose Ref — "Prompt Always Wins" Has Conditions

After publishing, I ran additional tests on what happens with a different-shaped openpose ref, and the original conclusion needed revision.

Modified OpenPose Refs (4 Patterns)

I took the original openpose image (0.openpose.jpg, standing pose), flipped it vertically and rotated it 90 degrees to create "unnatural poses," then specified a normal standing pose in the prompt.

Modification Image
Vertically flipped (upside-down) flipped
90° rotated (lying sideways) rot90
Test OpenPose Ref Prompt Result
O1 baseline Original (standing) Standing pose O1 Standing pose as expected
O2 🙃 Vertically flipped Standing pose O2 Standing pose (openpose ignored, prompt wins)
O3 🙃 Vertically flipped Jumping O3 Both-arms-raised jump (openpose ignored, prompt wins)
O4 ↻ 90° rotated Standing pose O4 Standing pose but canvas itself rotated 90°!

Up to this point the findings were: "The model rejects unnatural refs and falls back to the prompt" and "overall compositional orientation (portrait vs. landscape) can still be influenced by the ref."

But a Dramatic Ref + Pose-Silent Prompt Led to Complete Ref Victory

I generated a "colorful anatomical skeleton with arms spread in a T-shape and one leg raised high in a tree yoga pose" via HiDream's T2I and fed it as a ref:

warrior skeleton ref

Prompt mentions no pose at all — only subject and clothing:

curl -X POST http://localhost:8895/generate/skeleton -d '{
  "prompt": "Full body photograph of a young Asian woman wearing a gray sweater dress, soft natural lighting, white studio background.",
  "ref_image_paths": ["face","SYNTHETIC_WARRIOR_SKELETON","sweater"],
  "shift": 1.0, "seed": 42
}'

Enter fullscreen mode Exit fullscreen mode

Result:

X1 warrior yoga result

The tree yoga pose reproduced perfectly — T-shaped arms and single-leg stance, matching the skeleton ref exactly.

Revised Conclusions (3 Rules)

Synthesizing all 12 patterns, HiDream actually behaves like this:

  1. If the prompt mentions a pose, that takes first priority — prompt wins even when it contradicts the ref.
  2. If the prompt says nothing about the pose, the ref's pose is adopted — the more dramatic the ref, the clearer the transfer.
  3. If the ref appears "unnatural" (upside-down skeleton, etc.), the model defaults to a natural stance — though overall compositional orientation can still bleed through.

So "the openpose ref is basically useless" was an overstatement. More precisely: "when the prompt describes a pose, the ref gets overridden." The 8-pattern benchmark was all scenarios where the prompt specified dynamic motion, so it looked like the openpose ref was powerless.

Practical Impact

  • To fully control pose via ref: don't mention pose in the prompt + use a dramatic openpose/skeleton ref → ref pose transfers
  • To control pose via prompt: removing the openpose ref is fine (even if you leave it in, the prompt overrides it)
  • When ref and prompt conflict: prompt wins (including the ref doesn't help)

You can effectively choose whether pose comes from the ref or the prompt by whether or not you mention the pose in the prompt. If you want the openpose ref to drive the pose, keep pose description out of the prompt.


Related: