惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

V
Visual Studio Blog
MongoDB | Blog
MongoDB | Blog
Engineering at Meta
Engineering at Meta
云风的 BLOG
云风的 BLOG
Microsoft Azure Blog
Microsoft Azure Blog
B
Blog RSS Feed
T
The Exploit Database - CXSecurity.com
P
Privacy & Cybersecurity Law Blog
Know Your Adversary
Know Your Adversary
月光博客
月光博客
I
InfoQ
阮一峰的网络日志
阮一峰的网络日志
NISL@THU
NISL@THU
爱范儿
爱范儿
S
Securelist
博客园 - 叶小钗
C
CERT Recently Published Vulnerability Notes
Recorded Future
Recorded Future
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
aimingoo的专栏
aimingoo的专栏
D
DataBreaches.Net
G
GRAHAM CLULEY
P
Proofpoint News Feed
A
About on SuperTechFans
Google DeepMind News
Google DeepMind News
C
Cyber Attacks, Cyber Crime and Cyber Security
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
T
Tor Project blog
Stack Overflow Blog
Stack Overflow Blog
T
Threat Research - Cisco Blogs
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
T
Tailwind CSS Blog
有赞技术团队
有赞技术团队
Hugging Face - Blog
Hugging Face - Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
Recent Announcements
Recent Announcements
P
Proofpoint News Feed
The GitHub Blog
The GitHub Blog
The Cloudflare Blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
Last Week in AI
Last Week in AI
Y
Y Combinator Blog
Jina AI
Jina AI
大猫的无限游戏
大猫的无限游戏
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
罗磊的独立博客
博客园 - 【当耐特】
H
Help Net Security
F
Fortinet All Blogs
T
The Blog of Author Tim Ferriss

DEV Community

Authentication Security Deep Dive: From Brute Force to Salted Hashing (With Java Examples) Why AI Systems Don’t Fail — They Drift Spilling beans for how i learn for exam😁"Reinforcement Learning Cheat Sheet" I Replaced Chrome with Safari for AI Browser Automation. Here's What Broke (and What Finally Worked) How Python Borrows Other People's Work The $40 Architecture: Processing 1 Billion API Requests with 99.99% Uptime Vibe Coding: A Workflow Guide (From Zero to SaaS) Most webhook security guides protect the wrong side. The scary part is delivery. Headless CMS for TanStack Start: Build a Blog with Cosmic EU Age Verification App "Hacked in 2 Minutes" — What Actually Happened Comfy Cloud’s delete function does not actually remove files Running AI Models on GPU Cloud Servers: A Beginner Guide Event-driven media intelligence with AWS Step Functions and Bedrock I scored 500 AI prompts across 8 quality dimensions — here's what broke How to Call Google Gemini API from Next.js (Free Tier, No Backend Needed) The Portal Protocol: Reclaiming Human Connection in the Age of AI How to Fix Your Team's Scattered Knowledge Problem With a Self-Hosted Forum Intro to tc Cloud Functors: A Graph-First Mental Model for the Modern Cloud Designing Multi-Tenant Backends With Both Ownership and Team Access I Built a Neumorphic CSS Library with 77+ Components — Here's What I Learned PostgreSQL Performance Optimization: Why Connection Pooling Is Critical at Scale Cómo construí un SaaS multi-rubro para gestionar expensas en Argentina con FastAPI + Vue 3 🚀 I Built an Ethical Hacking Scanner Tool – Open Source Project I Replaced /usage and /context in Claude Code With a Single Statusline A Pythonic Way to Handle Emails (IMAP/SMTP) with Auto-Discovery and AI-Ready Design I Collected 8.9 Million Polymarket Price Points — Here's What I Found About How Markets Really Move EcoTrack AI — Carbon Footprint Tracker & Dashboard Everyone's Using AI. No One Agrees How. 5 self-hosted ebook managers worth trying in 2026 Building Your First AI Agent with LangChain: From Chatbot to Autonomous Assistant Common SOC 2 Failures (Real World) Stop Vibe-Checking Your AI App: A Practical Guide to Evals How to Use SonarQube and SonarScanner Locally to Level Up Your Code Quality Your Next To-Do App Is Dead — I Replaced Mine with an OpenClaw AI Sign a Nostr event in 60 lines of Python using coincurve — no nostr-sdk, no nbxplorer, no rust toolchain ITGC Audit Explained Like You’re in Big 4 Patch Tuesday abril 2026: Microsoft parcha 163 vulnerabilidades y un zero-day en SharePoint Stop scraping everything: a better way to track competitor price changes Listing on MCPize + the Official MCP Registry while routing payments OUTSIDE the marketplace — how I kept 100% of my x402 revenue Building an AI-Powered Risk Intelligence System Using Serverless Architecture Why We Ripped Function Overloading Out of Our AI Toolchain Testing AI-Generated Code: How to Actually Know If It Works SaaS Churn Is Killing Your Business. Here Is What to Do About It (Without a Support Team) The Speed of AI Is No Longer Linear - And Self-Improving Models Are Why How to Implement RBAC for MCP Tools: A Practical Guide for Engineering Teams From Standard Quote to Persuasive Proposal: AI Automation for Arborists I built a CLI that scaffolds complete multi-tenant SaaS apps Axios CVE-2025–62718: The Silent SSRF Bug That Could Be Hiding in Your Node.js App Right Now The dashboard that ended our friendship Data Pipelines Explained Simply (and How to Build Them with Python) The Hidden Cost of AI Systems Nobody Talks About. undefined vs undeclared, and how typeof behaves Switching from file-based jobs to NATS/Kafka in Rust without changing code io_uring Adventures: Rust Servers That Love Syscalls Why Agentic AI is Killing the Traditional Database The POUR principles of web accessibility for developers and designers Quantum Neural Network 3D — A Deep Dive into Interactive WebGL Visualization How To Install Caveman In Codex On macOS And Windows Automation Pipeline Reliability: Why Your Workflow Breaks When Nobody Is Watching I Built an 'Open World' AI Coding Agent — It Works From ANY Folder From Freelancing to Product: A Tech Service Company's SaaS Transformation China's AI Giants: Adding Tencent Hunyuan & ByteDance Doubao to AI University (74 Providers) On the Vibe Coders and Their Lies clerk: Auto-Summarize Your Claude Code Sessions AI Weekly — 2026/04/10–04/17 | The Model Lockdown Is Here, but the Toolchain Is the Real Battleground AI 週報 — 2026/04/10–2026/04/17 模型封鎖潮來了,但工具鏈才是真戰場 Maybe this is how Open-Source apps are born... 🚀 Fine-Tune LLMs with LoRA and QLoRA: 2026 Guide tRPC v11 + Next.js App Router: End-to-End Type Safety Without the Boilerplate ShadCN UI in 2026: Why I Stopped Installing Component Libraries and Started Owning My Components SaaS Billing in React Server Components: Stripe + Supabase Without a Single `useEffect` Join our DEV Weekend Challenge — $1,000 in Prizes Across TEN winners! Submissions Due April 20 at 6:59 AM UTC. Implementing FSRS Spaced Repetition in Flutter + Supabase — Adding Memory Science to an AI Learning App "I Texted My Localhost From the Train — Claude Code Fixed the Bug Before I Got Home" I Built a Sales Prep AI and It Went Deeper Than Expected Design to Code #2: One JSON, Eleven Outputs Solving the 100M-Row Problem: A Summary Table Pattern for High-Volume Push Notification Logs Flutter Web With Wasm: What Actually Changes For Developers I Built 50 Royalty-Free Soundtracks for My Side Project in a Weekend Using AI Music Generation The Vibe Coding Security Checklist: 7 Things to Check Before You Ship Stop Letting Googlebot Guess Fix Your React App's SEO Right Desconstruindo o Streaming do LinkedIn: Como Criar um Engine de Extração de Vídeo de Alta Performance com HLS e FFmpeg (EDA Part-1) EDA (Exploratory Data Analysis) Explained With Real Life — Why Looking at Your Data Is the Most Important Step in Machine Learning Brand Relationship Management at Scale: Our 4-Touch Outreach System for 200+ Brands Why String.fromEnvironment() Might Return an Empty String in Dart JGuardrails 1.0.0 — Hardening Java LLM Apps Against Jailbreaks, Toxicity, and Prompt Injection Plan and Schedule a Full Week of Threads Content From One Claude Conversation Coding Cat Oran Ep3, Five Tables Changed Everything Updated: BFF Pattern I'm done watching freelancers get buried by 200 proposals. So I'm building the alternative. This is my first post BFS Algorithm in Java Step by Step Tutorial with Examples Tracking LLM Pricing Monthly: An Open Dataset for 22 AI Models How We Measure Content ROI on a Comparison Site: Revenue Attribution Without Perfect Data Introducing Nova AI Ops: The AI-Native Operating System for SRE Teams I built a free desktop video downloader for Windows — Grabbit How Talkie OCR Helps Vision-Impaired & Dyslexic Users Read the World Around Them VRCFaceTracking安装和iPhone面捕配置教程,有bug Even CrowdStrike Can't See Your Agents The Automation Gold Rush: What n8n Workflows and Claude Are Opening Up for Developers Right Now
DreamZero vs Motus
SB Lee · 2026-05-19 · via DEV Community

I. Autoregressive & Bidirectional Unified Generation

DreamZero and Motus are the two most representative works in the field of World Action Models (WAMs). Both conduct joint video + action generation, adopt flow matching, and use chunk-based generation. However, they differ completely in generation paradigms, training mechanisms, alignment methods, and inference logic.

DreamZero adopts autoregressive causal generation (only looks at the past, not the future).

It generates chunk by chunk in chronological order.
It strictly follows causal attention masking.
At each time step, it can only see historical observations, not the future.
It naturally conforms to robots' physical timing logic.
It is suitable for closed-loop control and real-time inference.

Motus adopts a flexible unified framework (capable of looking at the future, non-causal).

Essentially built on the UniDiffuser-based unified multimodal framework with a bidirectional non-causal underlying architecture, Motus supports switching among 5 task modes.
These include: VLA (action-only generation), WM (video-only generation), IDM (inverse dynamics), VGM (video-guided action), and Video-Action Joint generation.
In video-action joint generation mode, Motus adopts a bidirectional non-causal architecture to generate the entire future video chunk and action chunk in one go, achieving temporal alignment via video sparsification and action densification.
In VLA mode, Motus only outputs actions without generating videos; the video side remains in a pure noise state with no video generation performed.
At this time, there is no bidirectional global modeling, as the video side contains no content for bidirectional interaction.
Its inference relies on the interaction between the action expert and the understanding expert, rather than complete tri-modal joint attention.

TIPS:Architecturally, Motus is bidirectional and non-causal.
Functionally, Motus is a flexible unified framework.
Motus does not adhere to non-causality in all cases. Its default architecture is bidirectional, built on the bidirectional, non-causal UniDiffuser, but causality can be restricted via masking.

II. Chunk-Based Generation Mechanism**

Both use chunks, but their logics are entirely different.

DreamZero: Autoregressive Chunk Generation

Each chunk depends on the output of the previous chunk.
Training uses chunk-wise teacher forcing.
Inference uses KV caching, computing only the current chunk.
Each chunk corresponds to a fixed duration of action.

Motus: Global Bidirectional Chunk Modeling, Multi-Step Denoising Generation

During training: Temporal relationships of all chunks are modeled at once with Tri-model Joint Attention. All frames and action tokens are mutually visible (bidirectional and non-causal), conditioned on real historical observations, and independent of autoregressive generation results.
During inference: Multi-step iterative denoising is performed based on Rectified Flow (as shown in Algorithm 6, where τ gradually decreases from Tτ to 1). The entire future sequence is generated in one pass, not a single forward step.
No autoregressive KV caching: Due to the non-causal architecture, there is no need to maintain historical KV caching as DreamZero does. However, the diffusion denoising process requires multi-step computation, with inference overhead concentrated on global attention calculation within a single step.
Video downsampling alignment is required: Long sequences are generated in one pass, so the video frame rate must match the action frequency (e.g., 1:6 sparsification) to avoid token imbalance.
Suitable for unified modeling, not long-horizon closed-loop tasks.

How Motus Achieves Global Temporal Dependencies

Motus’s global temporal dependencies are not generated step-by-step in chronological order like DreamZero, which only looks at the past. Instead:
The entire temporal sequence of "past, present, future" is fed into the model at once.
It adopts Tri-model Joint Attention, concatenating the QKV of three experts in the attention layer to achieve cross-modal information fusion.
Each expert retains independent feedforward networks and normalization layers. All frames and action tokens are mutually visible and influential, enabling unified modeling of the entire temporal sequence.
It adopts UniDiffuser-style global bidirectional temporal modeling. During training and inference, it processes:
[Initial frames, all future frames to be generated, all future actions to be generated] into a long token sequence, which is fed into the Transformer.
For example:[Historical Frames], [Future Frame 1], [Future Frame 2], [Future Frame 3], [Action 1], [Action 2], [Action 3], [Action 4]
It is not generated frame-by-frame or chunk-by-chunk; instead, the entire temporal sequence is input and modeled at once.
Motus uses Tri-model Joint Attention with a simple rule: all temporal positions are mutually visible, with no causal restrictions.

  • Historical Frames → Can see future frames
  • Future Frames → Can see historical frames
  • Actions → Can see all video frames (past + future)
  • Video Frames → Can see all actions

It does not use lower triangular causal masking, which is completely opposite to GPT and DreamZero.
Motus adopts a deliberately designed Video-Sparse, Action-Dense strategy, setting the video frame rate to 1/6 of the action frequency. This balances the imbalance in token counts between video and actions, preventing the model from overfitting video generation and weakening action prediction capabilities.

Regarding Downsampling

Whether video downsampling is required is not a manually chosen strategy but a natural consequence of the model’s generation paradigm.
Bidirectional and autoregressive models differ fundamentally in this characteristic, which directly affects the authenticity of video timing and the accuracy of action alignment.
Bidirectional generation models (such as Motus, UniDiffuser, and mainstream video generation architectures) adopt a global synchronous generation mechanism. The model processes the complete video sequence in one pass, with all frames mutually visible and influencing each other.
Past frames can attend to future frames, and future frames can act backward on past frames, forming a globally unified temporal representation.
To achieve full-sequence joint modeling, the model must use fixed-length video chunks during both training and inference. Because the global attention mechanism requires a predefined sequence length to construct an attention matrix of fixed size.
If the sequence length is not fixed, the size of the attention map changes dynamically, making training unstable.
Meanwhile, the bidirectional diffusion denoising process requires synchronous noise addition and denoising for the entire video. Temporal positional encoding, noise scheduling, and sampling steps all depend on a unified sequence length.
Thus, bidirectional models can only generate fixed lengths set during training when starting generation at any time point. When task duration does not match the model’s supported length, video downsampling or stretching is required, which inevitably distorts the native frame rate and temporal structure.
In contrast, autoregressive models follow strict causal generation. The model generates future content chunk-by-chunk based solely on historical observations and continuously records historical information via KV caching.
It does not need to see the complete sequence in advance or construct a global attention map in one pass. Thus, it naturally supports video generation of arbitrary lengths without downsampling or distorting the original frame rate.
This characteristic enables autoregressive architectures to maintain the true temporal relationship of videos, ensuring precise natural alignment between video frames and actions, and providing a more reliable temporal foundation for robot closed-loop control.
Bidirectional models must use fixed lengths and rely on downsampling, with inference extendable via sliding windows. Autoregressive models support longer contexts via KV caching but are also limited by cache capacity.
DreamZero’s advantage lies in preserving the native frame rate (no downsampling alignment), not in supporting arbitrary lengths.

III. Video + Action Joint Generation

DreamZero: Single Model, Single DiT, Joint Denoising

Video and action share the same DiT backbone.
Same layers, same attention, same temporal sequence.
Video and action share denoising time steps.
Jointly optimized with the same flow matching loss.
Video guides action, action aligns with video, deeply coupled.
Minimal architecture, no expert separation.
The mechanism for video-guided action adopts implicit inverse dynamics.
Its action prediction relies entirely on the shared attention mechanism, learning actions end-to-end from generated videos with higher physical alignment.

Motus: Mixture-of-Experts (MoT), Tri-Modal Joint Attention

Three independent experts: video expert (Wan2.2), understanding expert (Qwen-VL), action expert (trained separately).
Tri-modal cross-attention per layer.
Separate temporal order and noise scheduling for video and action.
Unified under the UniDiffuser framework.
Supports 5 mode switches, with looser modal coupling.
The mechanism for video-guided action uses explicit inverse dynamics mode.
The model first generates complete future videos, then infers actions accordingly. This step-by-step logic may be more stable and interpretable for certain tasks.

IV. Differences in Flow Matching

DreamZero: Joint Flow Matching + Teacher Forcing

Video and action share time steps.
Inputs real historical frames during training to avoid error accumulation.
Centered on video prediction; actions derived from visual futures.
Chunk-wise teacher forcing (inputs real history chunk-by-chunk to predict the next chunk).

Motus: Rectified Flow + Decoupled Time Steps

Separate time steps for video and action.
Aims for unified modeling.
Supports multiple task mode switches.
Global one-shot teacher forcing (inputs real initial frames + history, predicts the entire future in one pass).

V. How Video Guides Action Generation

DreamZero: Vision-First → Implicit Inverse Dynamics

First predicts future video (visual planning).
Infers actions from video via shared attention, end-to-end learning.
Actions strictly follow video planning; video quality directly determines action quality.

Motus: Dual-Path Collaboration → Explicit Inverse Dynamics + Joint Generation

Path 1 (IDM Mode): First generates complete future video, then explicitly infers actions.
Path 2 (Video-Action Joint Mode): Video and action generated simultaneously, mutually constrained.
Video and action are collaborative, with no mandatory video priority → more flexible, but weaker physical alignment than DreamZero.

VI. Inference Mechanism

DreamZero: Closed-Loop Real-Time Inference

After each execution step, replace predicted frames in KV cache with real observations.
Completely eliminates error accumulation.
Explicitly optimized for 7Hz asynchronous closed-loop, with detailed error accumulation elimination mechanism.
Fast inference, deployable.
Designed for closed-loop control; preventing error accumulation is one of its core design goals, naturally suitable for long-horizon tasks.

Motus: Open-Loop Generation

Generates all video + action in one pass.
No ground truth injection, no closed loop.
Errors accumulate.
Motus’s flexible unified framework may incur additional inference overhead, such as slow speed, but no specific latency data is reported in the paper.
Architecturally, it focuses more on a unified generative framework. Its inference method is closer to one-shot open-loop planning.
While short-term tasks can be handled via short-term prediction with open-loop execution, long-horizon tasks requiring long-term visual feedback are theoretically more prone to error accumulation.

VII. Training & Deployment Difficulty

DreamZero

Simple architecture, single DiT, easy to reproduce.
Strict requirements for timing and alignment.
Stable training, runs on real robots.
Requires high compute power, with mature optimizations.
DreamZero has mature system-level optimizations (asynchronous execution, quantization, DiT caching), achieving real-time control with 150ms latency.

Motus

Complex architecture, three experts, difficult integration.
Multi-stage training pipeline, data pyramid.
Deployment optimization details are sparsely documented in the paper.

VIII. Pre-Training Data Source & Scale

In terms of video backbone pre-training, both models leverage internet-scale video data to acquire basic spatiotemporal priors. DreamZero uses Wan2.1 as the backbone, while Motus initializes based on the newer Wan2.2 5B model.
Despite sharing the same origin, their data strategies during robot fine-tuning follow completely different technical paths.
DreamZero adopts a highly focused real robot data strategy, using approximately 500 hours of real-world teleoperation data, all from the AgiBot G1 dual-arm robot, collected across 22 different environments with a strict "task diversity over repetition" principle.
Its data strategy explicitly limits the number of demonstrations per task, actively discarding a task after 50 collections to force a broader task distribution and enhance zero-shot generalization.
For cross-embodiment transfer, DreamZero uses very little video-only data, including 20 minutes of unlabeled video from the YAM robot and 12 minutes of human-view video, which significantly improves performance on unseen tasks.
In contrast, Motus adopts a larger six-layer data pyramid structure, forming a full-level, large-scale heterogeneous training system spanning web videos, human videos, simulation data, task-agnostic data, multi-robot mixed data, and target robot data.
Its cross-embodiment and generalization capabilities rely on large-scale heterogeneous data from diverse sources such as Egodex, AgiBot, RDT, RoboMind, and RoboTwin.
Motus specifically sets Level 4: Task-Agnostic Data in its six-layer data pyramid.
This data is generated by random sampling in the target robot’s action space, independent of specific tasks, to align latent action space with real robot control signals.
This design is critical for the stability of latent action learning, while DreamZero’s data strategy does not emphasize such alignment data.
Overall, DreamZero follows a small, refined, real-data-first, diversity-focused data path, better aligned with the low-data-cost requirements of real robot deployment.
Motus follows a large-scale, multi-source, pyramid-style data path, relying on massive heterogeneous data for capability coverage. The two have fundamentally different data philosophies and deployment approaches.

IX. Action Representation

DreamZero uses the robot’s native joint space as action representation, directly outputting relative joint positions corresponding one-to-one with the robot body, typically around 14 dimensions.
This representation requires no additional spatial mapping, with the action space fully aligned with the real robot control space.
The model learns via inverse dynamics, directly mapping video predictions to executable joint control signals. Action decoding is simple and direct, highly aligned with real robot physical execution logic.
A minor detail: DreamZero’s joint space refers to relative joint positions, not raw joint angles.
This indicates that under the goal of aligning physical execution, it also applies normalization to adapt to different robot bodies.
In contrast, Motus uses latent action as a unified representation.
It first extracts pixel-level motion information via optical flow, then compresses it into a fixed-dimensional latent action representation via DC-AE.
This optical-flow-based action space is independent of specific robot morphology, naturally enabling cross-embodiment unified expression.
During execution, Motus decodes latent actions into corresponding robot control signals.
In short, DreamZero learns inverse dynamics for specific robots, focusing on precise execution.
Motus learns general motion priors, focusing on cross-domain generalization.
The two represent two distinct technical paths: physical execution alignment versus general motion abstraction.