惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

罗磊的独立博客
T
Threatpost
S
Security Archives - TechRepublic
S
Securelist
Latest news
Latest news
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
Hugging Face - Blog
Hugging Face - Blog
美团技术团队
月光博客
月光博客
宝玉的分享
宝玉的分享
The Cloudflare Blog
M
MIT News - Artificial intelligence
P
Proofpoint News Feed
C
Cybersecurity and Infrastructure Security Agency CISA
Know Your Adversary
Know Your Adversary
博客园 - 叶小钗
P
Proofpoint News Feed
GbyAI
GbyAI
SecWiki News
SecWiki News
L
LINUX DO - 热门话题
Hacker News: Ask HN
Hacker News: Ask HN
B
Blog
Vercel News
Vercel News
AI
AI
Simon Willison's Weblog
Simon Willison's Weblog
N
News | PayPal Newsroom
T
Tailwind CSS Blog
S
Secure Thoughts
The Last Watchdog
The Last Watchdog
TaoSecurity Blog
TaoSecurity Blog
V
V2EX
Security Latest
Security Latest
P
Privacy International News Feed
Google DeepMind News
Google DeepMind News
Attack and Defense Labs
Attack and Defense Labs
Jina AI
Jina AI
L
LINUX DO - 最新话题
T
Tenable Blog
IT之家
IT之家
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
博客园_首页
C
Cisco Blogs
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
NISL@THU
NISL@THU
G
Google Developers Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
D
DataBreaches.Net
人人都是产品经理
人人都是产品经理
Schneier on Security
Schneier on Security
B
Blog RSS Feed

MarkTechPost

A Coding Implementation of End-to-End Brain Decoding from MEG Signals Using NeuralSet and Deep Learning for Predicting Linguistic Features Meta Introduces Autodata: An Agentic Framework That Turns AI Models into Autonomous Data Scientists for High-Quality Training Data Creation Qwen AI Releases Qwen-Scope: An Open-Source Sparse AutoEncoders (SAE) Suite That Turns LLM Internal Features into Practical Development Tools A Coding Deep Dive into Agentic UI, Generative UI, State Synchronization, and Interrupt-Driven Approval Flows Moonshot AI Open-Sources FlashKDA: CUTLASS Kernels for Kimi Delta Attention with Variable-Length Batching and H20 Benchmarks Microsoft Research’s World-R1 Uses Flow-GRPO and 3D-Aware Rewards to Inject Geometric Consistency Into Wan 2.1 Without Architectural Changes A Coding Implementation on Pyright Type Checking Covering Generics, Protocols, Strict Mode, Type Narrowing, and Modern Python Typing IBM Releases Two Granite Speech 4.1 2B Models: Autoregressive ASR with Translation and Non-Autoregressive Editing for Fast Inference Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter Meta FAIR Releases NeuralSet: A Python Package for Neuro-AI That Supports fMRI, M/EEG, Spikes, and HuggingFace Embeddings smol-audio: A Colab-Friendly Notebook Collection for Fine-Tuning Whisper, Parakeet, Voxtral, Granite Speech, and Audio Flamingo 3 A Coding Implementation on Document Parsing Benchmarking with LlamaIndex ParseBench Using Python, Hugging Face, and Evaluation Metrics Poolside AI Introduces Laguna XS.2 and M.1: Agentic Coding Models Reaching 68.2% and 72.5% on SWE-bench Verified How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI OpenAI Releases Privacy Filter: A 1.5B-Parameter Open-Source PII Redaction Model with 50M Active Parameters Top 10 Physical AI Models Powering Real-World Robots in 2026 How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research Build a Reinforcement Learning Powered Agent that Learns to Retrieve Relevant Long-Term Memories for Accurate LLM Question Answering OpenMOSS Releases MOSS-Audio: An Open-Source Foundation Model for Speech, Sound, Music, and Time-Aware Audio Reasoning Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo The LoRA Assumption That Breaks in Production How to Build a Fully Searchable AI Knowledge Base with OpenKB, OpenRouter, and Llama How to Build Smarter Multilingual Text Wrapping with BudouX Through Parsing, HTML Rendering, Model Introspection, and Toy Training Top 7 Benchmarks That Actually Matter for Agentic Reasoning in Large Language Models RAG Without Vectors: How PageIndex Retrieves by Reasoning A Coding Tutorial on Datashader on Rendering Massive Datasets with High-Performance Python Visual Analytics xAI Launches grok-voice-think-fast-1.0: Topping τ-voice Bench at 67.3%, Outperforming Gemini, GPT Realtime, and More A Coding Implementation on kvcached for Elastic KV Cache Memory, Bursty LLM Serving, and Multi-Model GPU Sharing Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation Meet GitNexus: An Open-Source MCP-Native Knowledge Graph Engine That Gives Claude Code and Cursor Full Codebase Structural Awareness A Coding Implementation on Deepgram Python SDK for Transcription, Text-to-Speech, Async Audio Processing, and Text Intelligence A Coding Implementation on Microsoft’s OpenMementos with Trace Structure Analysis, Context Compression, and Fine-Tuning Data Preparation DeepSeek AI Releases DeepSeek-V4: Compressed Sparse Attention and Heavily Compressed Attention Enable One-Million-Token Contexts Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Training Architecture Achieving 88% Goodput Under High Hardware Failure Rates Mend Releases AI Security Governance Framework: Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model Mend.io Releases AI Security Governance Framework Covering Asset Inventory, Risk Tiering, AI Supply Chain Security, and Maturity Model OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval A Coding Tutorial on OpenMythos on Recurrent-Depth Transformers with Depth Extrapolation, Adaptive Computation, and Mixture-of-Experts Routing Google Cloud AI Research Introduces ReasoningBank: A Memory Framework that Distills Reasoning Strategies from Agent Successes and Failures Xiaomi Releases MiMo-V2.5-Pro and MiMo-V2.5: Matching Frontier Model Benchmarks at Significantly Lower Token Cost How to Design a Production-Grade CAMEL Multi-Agent System with Planning, Tool Use, Self-Consistency, and Critique-Driven Refinement Alibaba Qwen Team Releases Qwen3.6-27B: A Dense Open-Weight Model Outperforming 397B MoE on Agentic Coding Benchmarks A Detailed Implementation on Equinox with JAX Native Modules, Filtered Transforms, Stateful Layers, and End-to-End Training Workflows Next Leap to Harness Engineering: JiuwenClaw Pioneers ‘Coordination Engineering’ Photon Releases Spectrum: An Open-Source TypeScript Framework that Deploys AI Agents Directly to iMessage, WhatsApp, and Telegram OpenAI Open-Sources Euphony: A Browser-Based Visualization Tool for Harmony Chat Data and Codex Session Logs Hugging Face Releases ml-intern: An Open-Source AI Agent that Automates the LLM Post-Training Workflow A Coding Implementation to Build a Conditional Bayesian Hyperparameter Optimization Pipeline with Hyperopt, TPE, and Early Stopping Google Introduces Simula: A Reasoning-First Framework for Generating Controllable, Scalable Synthetic Datasets Across Specialized AI Domains A Coding Implementation on Qwen 3.6-35B-A3B Covering Multimodal Inference, Thinking Control, Tool Calling, MoE Routing, RAG, and Session Persistence Moonshot AI Releases Kimi K2.6 with Long-Horizon Coding, Agent Swarm Scaling to 300 Sub-Agents and 4,000 Coordinated Steps A Coding Implementation on Microsoft’s Phi-4-Mini for Quantized Inference Reasoning Tool Use RAG and LoRA Fine-Tuning OpenAI Scales Trusted Access for Cyber Defense With GPT-5.4-Cyber: a Fine-Tuned Model Built for Verified Security Defenders Moonshot AI and Tsinghua Researchers Propose PrfaaS: A Cross-Datacenter KVCache Architecture that Rethinks How LLMs are Served at Scale Meet OpenMythos: An Open-Source PyTorch Reconstruction of Claude Mythos Where 770M Parameters Match a 1.3B Transformer How TabPFN Leverages In-Context Learning to Achieve Superior Accuracy on Tabular Datasets Compared to Random Forest and CatBoost A Coding Implementation to Build an AI-Powered File Type Detection and Security Analysis Pipeline with Magika and OpenAI NVIDIA Releases Ising: the First Open Quantum AI Model Family for Hybrid Quantum-Classical Systems xAI Launches Standalone Grok Speech-to-Text and Text-to-Speech APIs, Targeting Enterprise Voice Developers A Coding Tutorial for Running PrismML Bonsai 1-Bit LLM on CUDA with GGUF, Benchmarking, Chat, JSON, and RAG A Coding Guide for Property-Based Testing Using Hypothesis with Stateful, Differential, and Metamorphic Test Design Anthropic Releases Claude Opus 4.7: A Major Upgrade for Agentic Coding, High-Resolution Vision, and Long-Horizon Autonomous Tasks Google AI Releases Auto-Diagnose: An Large Language Model LLM-Based System to Diagnose Integration Test Failures at Scale A End-to-End Coding Guide to Running OpenAI GPT-OSS Open-Weight Models with Advanced Inference Workflows Top 19 AI Red Teaming Tools (2026): Secure Your ML Models A Coding Guide to Build a Production-Grade Background Task Processing System Using Huey with SQLite, Scheduling, Retries, Pipelines, and Concurrency Control Qwen Team Open-Sources Qwen3.6-35B-A3B: A Sparse MoE Vision-Language Model with 3B Active Parameters and Agentic Coding Capabilities OpenAI Launches GPT-Rosalind: Its First Life Sciences AI Model Built to Accelerate Drug Discovery and Genomics Research Building Transformer-Based NQS for Frustrated Spin Systems with NetKet UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size How to Build a Universal Long-Term Memory Layer for AI Agents Using Mem0 and OpenAI A Coding Implementation to Build Multi-Agent AI Systems with SmolAgents Using Code Execution, Tool Calling, and Dynamic Orchestration A Technical Deep Dive into the Essential Stages of Modern Large Language Model Training, Alignment, and Deployment Google AI Launches Gemini 3.1 Flash TTS: A New Benchmark in Expressive and Controllable AI Voice Google DeepMind Releases Gemini Robotics-ER 1.6: Bringing Enhanced Embodied Reasoning and Instrument Reading to Physical AI Google Launches ‘Skills’ in Chrome: Turning Reusable AI Prompts into One-Click Browser Workflows A Coding Implementation of Crawl4AI for Web Crawling, Markdown Generation, JavaScript Execution, and LLM-Based Structured Extraction TinyFish AI Releases Full Web Infrastructure Platform for AI Agents: Search, Fetch, Browser, and Agent Under One API Key NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model A Hands-On Coding Tutorial for Microsoft VibeVoice Covering Speaker-Aware ASR, Real-Time TTS, and Speech-to-Speech Pipelines Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model A Coding Implementation of MolmoAct for Depth-Aware Spatial Reasoning, Visual Trajectory Tracing, and Robotic Action Prediction MiniMax Just Open Sourced MiniMax M2.7: A Self-Evolving Agent Model that Scores 56.22% on SWE-Pro and 57.0% on Terminal Bench 2 Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference Researchers from MIT, NVIDIA, and Zhejiang University Propose TriAttention: A KV Cache Compression Method That Matches Full Attention at 2.5× Higher Throughput How to Build a Secure Local-First Agent Runtime with OpenClaw Gateway, Skills, and Controlled Tool Execution How Knowledge Distillation Compresses Ensemble Intelligence into a Single Deployable AI Model Alibaba’s Tongyi Lab Releases VimRAG: a Multimodal RAG Framework that Uses a Memory Graph to Navigate Massive Visual Contexts A Coding Guide to Markerless 3D Human Kinematics with Pose2Sim, RTMPose, and OpenSim NVIDIA Releases AITune: An Open-Source Inference Toolkit That Automatically Finds the Fastest Inference Backend for Any PyTorch Model Five AI Compute Architectures Every Engineer Should Know: CPUs, GPUs, TPUs, NPUs, and LPUs Compared An End-to-End Coding Guide to NVIDIA KVPress for Long-Context LLM Inference, KV Cache Compression, and Memory-Efficient Generation Meta Superintelligence Lab Releases Muse Spark: A Multimodal Reasoning Model With Thought Compression and Parallel Agents Sigmoid vs ReLU Activation Functions: The Inference Cost of Losing Geometric Context A Coding Guide to Build Advanced Document Intelligence Pipelines with Google LangExtract, OpenAI Models, Structured Extraction, and Interactive Visualization Google AI Research Introduces PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing A Comprehensive Implementation Guide to ModelScope for Model Search, Inference, Fine-Tuning, Evaluation, and Export
Meet Qwen-RobotSuite: Three Embodied AI Models for VLA Manipulation, Video World Modeling, and Navigation
https://www.facebook.com/MarkTechPost/ · 2026-06-17 · via MarkTechPost

The Qwen team has released three embodied AI models, grouped as Qwen-Robot-Suite. The three are Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav. Each is built on a Qwen vision-language backbone and targets a different robotics problem.

Qwen-RobotManip is a Vision-Language-Action model for manipulation, built on Qwen3.5-4B. Qwen-RobotWorld is a language-conditioned video world model with a 60-layer MMDiT and a frozen Qwen2.5-VL encoder. Qwen-RobotNav is a navigation model built on Qwen3-VL, available at 2B, 4B, and 8B sizes.

Qwen-Robot-Suite

Qwen-Robot-Suite is not a single model. It is a suite of three independent foundation models. Two of them, RobotManip and RobotNav, ship with public GitHub repositories.

Robotics data is fragmented across hardware and tasks. Different robots use incompatible observation and action formats. A policy trained on one arm rarely transfers to another.

The three research reports address this fragmentation in different ways. RobotManip aligns action representations so manipulation data scales. RobotWorld uses language as a unified action interface for video prediction. RobotNav exposes a controllable observation interface for navigation tasks.

Here is the core split between the three releases:

ModelProblemBackboneOutput
Qwen-RobotManipRobotic manipulationQwen3.5-4B (Qwen-VL)Continuous robot actions
Qwen-RobotWorldEmbodied world modelingFrozen Qwen2.5-VLPredicted future video
Qwen-RobotNavMobile navigationQwen3-VL (2B/4B/8B)Waypoint trajectories

Qwen-RobotManip: Alignment Unlocks Scale for Manipulation

Qwen-RobotManip is a Vision-Language-Action (VLA) foundation model. It is built on Qwen-VL and predicts continuous robot actions.

A VLA model takes camera views and a language instruction. It then outputs low-level robot actions. The challenge is that manipulation data is heterogeneous by nature.

Different robots record states and actions in incompatible formats. When demonstrations arrive with mismatched representations, scaling data produces interference. RobotManip solves this with a unified alignment framework.

The Unified Alignment Framework

The framework has three complementary mechanisms. First is a canonical state-action representation. It is an 80-dimensional vector with per-dimension binary masking.

This vector holds two 29-dimensional per-arm blocks plus 22 reserved dimensions. Each block stores joint positions, end-effector pose, gripper state, and dexterous hand joints. Robots populate only the dimensions they have.

Second is a camera-frame delta pose parameterization. End-effector actions are expressed as deltas in the camera frame. This makes visually similar motions numerically proximate across embodiments.

Third is an in-context policy adaptation mechanism. It reads recent execution history as an implicit embodiment identifier. The policy adjusts behavior at deployment time without parameter updates.

A dual-stream co-training strategy runs alongside this. It jointly optimizes manipulation data and a vision-language stream. This prevents the backbone’s perception and reasoning from eroding.

The Data Engine

RobotManip assembles roughly 38,100 hours of manipulation data. It uses only open-source datasets and human videos. No proprietary data collection was used.

A human-to-robot synthesis pipeline produces most of this scale. It converts egocentric hand demonstrations into robot trajectories. The pipeline renders across 15 robot platforms.

This synthesis alone yields about 24,808 hours of demonstrations. The egocentric source data is about 1,933 hours. Open-source robot datasets contribute over 11,000 hours.

The pipeline separates action alignment from visual alignment. Action alignment retargets hand keypoints to gripper poses. Visual alignment uses SAM3 masking, ProPainter inpainting, and MuJoCo inverse kinematics.

A five-stage curation pipeline then filters the combined corpus. It catches sudden changes, temporal misalignment, and extreme values. One check found 81% of episodes in a subset failed state-action alignment.

Benchmark Results

The research report argues standard benchmarks fail to measure generalization. Models without robot pretraining match pretrained ones on in-distribution tests. RobotManip therefore focuses on out-of-distribution (OOD) settings.

Benchmark (OOD)Prev. SOTA (π0.5)Qwen-RobotManip
LIBERO-Plus84.491.4
RoboTwin-C2R Hard47.969.4
EBench27.145.6
RoboCasa36516.935.9
RoboTwin-IF49.672.2

The largest reported gap is on cross-embodiment transfer. RobotManip reaches 23.9% using camera-frame EEF actions. That is 3.2× the 7.5% achieved by π0.5.

The model also ranks 1st on the RoboChallenge Table30-v1 generalist track. It scores a 20% relative improvement over the prior best. Real-robot validation covers AgileX ALOHA, Franka, UR, and ARX platforms.

Qwen-RobotWorld: Language as a Universal Action Interface

Qwen-RobotWorld is a language-conditioned video world model. It predicts future visual trajectories from a current observation. Natural language serves as the unified action interface.

A world model learns environment dynamics. Given a current state and an action, it predicts the next state. RobotWorld represents states as video frames and actions as text.

This is important because language is embodiment-agnostic. One instruction encodes the action sequence, goal, and constraints. It works across a Franka gripper, an Aloha dual-arm system, or a humanoid.

The Double-Stream MMDiT Architecture

The model uses a 60-layer double-stream Multimodal Diffusion Transformer. An understanding stream processes a frozen Qwen2.5-VL encoder’s features. A generation stream processes video-VAE latents.

The two streams interact via joint attention at every layer. Using an MLLM as the action encoder gives two advantages. It parses compositional instructions and constrains physically plausible transitions.

The MMDiT has 20B parameters. The VAE adopts the Wan-VAE architecture. The context length supports up to 48,360 video tokens.

A Scene2Robot mechanism reuses this backbone for cross-embodiment synthesis. It processes scene, robot reference, and generation segments together. This enables human-to-robot video transfer without robot-specific prompting.

The Embodied World Knowledge Dataset

Training uses the Embodied World Knowledge (EWK) dataset. It contains roughly 8.6M video-text pairs. That spans over 200M observation frames.

The corpus covers four embodied domains plus general video. Manipulation provides about 5.9M samples across 20+ morphologies. Driving, navigation, and human-to-robot transfer fill out the rest.

An action-language mapping framework standardizes everything. It converts 20+ embodiment types and 500+ action categories into language. A hierarchical five-layer annotation pipeline produces the captions.

Benchmark Results

RobotWorld was evaluated on four established benchmarks. It ranks 1st overall on two of them:

BenchmarkResultRanking
EWMBench4.601st overall
DreamGen Bench4.9521st overall
WorldModelBench8.991st open-source (3rd overall)
PBench0.8041st open-source

On EWMBench it leads motion fidelity with an HSD of 0.566. That is a 33% gain over the runner-up. Scene consistency reaches 0.914.

On WorldModelBench it scores 1.00 on four physics-adherence categories. These are Newton’s laws, mass conservation, fluid dynamics, and gravity. Penetration scores 0.94, and instruction following scores 2.33 out of 3.0.

Qwen-RobotNav: A Controllable Interface for Navigation

Qwen-RobotNav is a scalable navigation model built on Qwen3-VL. It reframes multi-task navigation as observation context modeling. The model exposes a parameterized interface for external control.

Navigation spans many task families. Instruction following, point-goal navigation, object search, target tracking, and driving all differ. Each demands a different strategy for consuming the visual stream.

Instruction following needs long memory to re-reference landmarks. Target tracking needs only the most recent frames. No fixed context strategy serves all tasks well.

The Parameterized Interface

RobotNav formulates all tasks as waypoint trajectory prediction. It predicts 8 waypoints, each with a 2D position and heading. A lightweight 4-layer MLP head produces these from the backbone.

The interface has two configuration dimensions. Task modes select navigation behavior across VLN, PointNav, ObjNav, and Tracking. Observation parameters govern how visual history is encoded.

These observation controls include a visual token budget and temporal decay. They also include per-camera importance weights. Training-time randomization over all parameters ensures robustness.

Camera identity and temporal order use natural-language tags. This requires zero architectural modification to Qwen3-VL. Supporting a new platform needs only a new prompt template.

The Agentic System

The interface makes RobotNav a building block for agentic systems. An upper-tier planner decomposes long-horizon goals into sub-goals. Qwen3.6-Plus serves as this planner in the system.

The planner reconfigures RobotNav’s task mode mid-episode. RobotNav serves as the reactive executor. The two tiers communicate exclusively through natural language.

A two-level memory supports long-horizon reasoning. Single-episode memory summarizes each rollout. Cross-episode memory accumulates durable conclusions like searched regions.

Benchmark Results

RobotNav was trained on 15.6M samples. Navigation trajectory data forms 85% of this. Vision-language reasoning data fills the remaining 15%.

BenchmarkMetricResult
VLN-CE RxR (Val-Unseen)Success Rate76.5%
VLN-CE R2R (Val-Unseen)Success Rate72.1%
EVT-BenchTracking Rate90.0%
HM3Dv2 (ObjectNav)Success Rate75.6%
NAVSIMPDMS91.4

The agentic system sets new state-of-the-art on Embodied Question Answering. It improves over the best prior method by 10.8% on HM-EQA. It also improves by 15.4% on EXPRESS-Bench while requiring 77% fewer navigation steps.

The report shows performance improving from 2B to 8B parameters. Joint multi-task training develops a shared spatial-planning substrate. The report states this transfers across task families.

Use Cases with Examples

Each model maps to concrete deployment scenarios. The examples below combine report-supported results with illustrative framing.

  • RobotManip for few-shot deployment on new hardware: A team has a Franka arm and a handful of demonstrations. They fine-tune RobotManip on their own workspace. The report shows the pretrained prior helps more on clutter and unseen states than training from scratch.
  • RobotManip for cross-embodiment skill transfer: A policy is jointly fine-tuned on 6K CobotMagic and 130 ARX demonstrations. It is then tested on four novel ARX tasks with zero target-task demonstrations. The research reports 55.0% success, over 4× the best ablated variant.
  • RobotWorld as a synthetic data engine: A VLA policy needs more training data than physical collection allows. The research team lists synthetic data generation as one of three application directions. RobotWorld can generate video for new language instructions.
  • RobotWorld as a policy evaluation environment: The research lists policy evaluation as a second application direction. A policy can be run against generated trajectories before real hardware. This is presented as a direction, not a benchmarked result.
  • RobotNav inside an agentic system: An upper-tier planner decomposes a long-horizon goal into sub-goals. It dispatches navigation calls with different task modes and context settings. The research team’s agentic system improves over the best prior EQA method by 10.8% on HM-EQA.
  • RobotNav for autonomous driving. The same model handles point-goal driving as one task mode. It reaches 91.4 PDMS on NAVSIM. The forward camera receives the highest token weight by default.

Comparison Table: The Three Models

The table below consolidates the technical details. It is a reference for picking the right model.

AttributeRobotManipRobotWorldRobotNav
Task typeManipulation (VLA)Video world modelNavigation
BackboneQwen3.5-4BFrozen Qwen2.5-VLQwen3-VL
Action interfaceCamera-frame EEF / jointNatural languageWaypoint trajectories
Training data~38,100 hours8.6M video-text pairs15.6M samples
Key architectureDiT flow-matching head60-layer double-stream MMDiTMLP action head
Headline result1st on RoboChallenge Table30-v11st on EWMBench, DreamGen76.5% SR on VLN-CE RxR
OutputContinuous actionsPredicted video8 waypoints (x, y, θ)
Public repoYes (GitHub)Blog onlyYes (GitHub)

The three research reports do not present a combined system. Read together, they cover complementary layers. RobotWorld handles simulation and data generation, RobotManip handles manipulation, and RobotNav handles mobility.

Implementation Note: The Canonical Action Vector

The RobotManip action representation is worth understanding in code terms. It is the mechanism that lets different robots share one model. Below is a simplified illustration of the masking idea.

# Conceptual sketch of RobotManip's 80-dim canonical vector.
# Two 29-dim per-arm blocks + 22 reserved dimensions = 80.
# This is illustrative, not the official implementation.

CANONICAL_DIM = 80
# Per-arm semantic groups, per the report:
ARM_GROUPS = {
    "joints": 7,      # joint positions
    "eef_pose": 9,    # 3D position + 6D rotation
    "gripper": 1,     # parallel gripper width
    "hand": 12,       # dexterous hand joints
}
ARM_BLOCK = sum(ARM_GROUPS.values())  # 29

def build_masked_action(populated_groups, arms):
    """Build the action vector and a per-dimension binary mask.

    populated_groups: set of group names this robot uses.
    arms: 1 for single-arm, 2 for dual-arm.
    Only populated dimensions carry supervision; the rest are masked.
    """
    action = [0.0] * CANONICAL_DIM
    mask = [0] * CANONICAL_DIM
    idx = 0
    for _ in range(arms):
        for group, size in ARM_GROUPS.items():
            if group in populated_groups:
                for d in range(idx, idx + size):
                    mask[d] = 1  # gradients flow only here
            idx += size
        if arms == 1:
            idx = ARM_BLOCK  # skip to the second block
    return action, mask

# A 7-DOF single-arm gripper fills joints, eef_pose, gripper of one arm.
_, mask = build_masked_action({"joints", "eef_pose", "gripper"}, arms=1)
print(sum(mask))  # -> 17 populated dims; the rest stay zero and masked

The per-dimension binary mask is the key idea. It ensures gradients flow only through semantically populated entries. This prevents spurious supervision on absent degrees of freedom.

The same masking principle appears in the flow-matching loss. Each sample contributes equally regardless of how many dimensions are active. This stops robots with more populated slots from dominating optimization.

Key Takeaways

  • Qwen released three embodied AI models: RobotManip, RobotWorld, and RobotNav (grouped as Qwen-RobotSuite)
  • RobotManip aligns robot data into one 80-dimensional action vector and ranks 1st on RoboChallenge Table30-v1.
  • RobotWorld uses natural language as the action interface and ranks 1st overall on EWMBench and DreamGen Bench.
  • RobotNav exposes a controllable token-budget interface and hits 76.5% SR on VLN-CE RxR.
  • Two of the three models ship with public GitHub repositories; RobotWorld is presented just as a research paper.

Check out the Technical details and Papers (Qwen-RobotManip, Qwen-RobotWorld, and Qwen-RobotNav)Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us