惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

H
Help Net Security
T
ThreatConnect
SecWiki News
SecWiki News
F
Future of Privacy Forum
AWS News Blog
AWS News Blog
C
Cisco Blogs
A
Arctic Wolf
Vercel News
Vercel News
The GitHub Blog
The GitHub Blog
Scott Helme
Scott Helme
V
V2EX
博客园 - 叶小钗
阮一峰的网络日志
阮一峰的网络日志
K
Kaspersky official blog
G
Google Developers Blog
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
P
Privacy International News Feed
C
Cyber Attacks, Cyber Crime and Cyber Security
N
News | PayPal Newsroom
Schneier on Security
Schneier on Security
NISL@THU
NISL@THU
Microsoft Azure Blog
Microsoft Azure Blog
量子位
The Hacker News
The Hacker News
Stack Overflow Blog
Stack Overflow Blog
Security Latest
Security Latest
M
Microsoft Research Blog - Microsoft Research
Google Online Security Blog
Google Online Security Blog
博客园_首页
C
CXSECURITY Database RSS Feed - CXSecurity.com
I
InfoQ
Google DeepMind News
Google DeepMind News
Y
Y Combinator Blog
The Cloudflare Blog
Microsoft Security Blog
Microsoft Security Blog
Martin Fowler
Martin Fowler
Cisco Talos Blog
Cisco Talos Blog
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知
T
Troy Hunt's Blog
F
Fox-IT International blog
S
Security @ Cisco Blogs
博客园 - 司徒正美
cs.CV updates on arXiv.org
cs.CV updates on arXiv.org
C
Comments on: Blog
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
L
LINUX DO - 最新话题
GbyAI
GbyAI
Project Zero
Project Zero
腾讯CDC
T
Tailwind CSS Blog

cs updates on arXiv.org

End-to-End Intracortical Speech Decoding from Neural Activity GIBLy: Improving 3D Semantic Segmentation through an Architecture-Agnostic Lightweight Geometric Inductive Bias Layer Concept Drift Adaptation Using Self-Supervised and Reinforcement Learning In Android Malware Detection TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models On Permutation Groups of Cyclic Codes over Finite Fields Plume Segmentation from MethaneSAT with Cross-Sensor Transfer Learning and Physics-Informed Postprocessing Optimizing Digital Therapeutic Interventions: Online Learning under Endogenous Adherence ECo-MoE: Embodiment-Conditioned Mixture of Experts Increases the Evolvability of Robots Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks PrivFusion: A Privacy-preserving Multi-Agent Framework for Harmonizing Distributed Datasets CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval Reframing LLM Agent Security as an Agent-Human Interaction Problem ArtSplat: Feed-Forward Articulated 3D Gaussian Splatting from Sparse Multi-State Uncalibrated Views Deep-Research Agents Can Be Poisoned via User-Generated Content ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale Private Adaptive Covariance Estimation via Gaussian Graphical Models Discovering Lexical Gaps Using Embeddings from Multilingual LLMs An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods Accuracy Analysis of the Proxy Point Method with Applications to Some Toeplitz Matrices Sketch Bug: Using Sketch-Based Input for Interactive Code Debugging Rubato: Transcribing Piano Music with Timestamps A lift for input-convex neural network training Can Graph-Based Microservice Performance Detection Be Used for Microservice Intrusion Detection? Learning regime-dependent governing equations: A symbolic decision tree approach Unlocking Apple's Private Cloud Compute: An Analysis of Privacy-Preserving Artificial Intelligence AcroRL: Learning Aggressive Quadrotor Inversion using Bidirectional Thrust Network Digital Twin for Congestion-Aware Predictive Traffic Routing using Graph MPNNs Terrain-Adaptive Grouser Wheel for Optimal Planetary Exploration: Design and Experimental Investigation Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering Resident KV Claims: A Conformance Contract for Future Reuse under Active KV Pressure Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression LLMs Show No Signs Of Individuated Metacognition Fourier Feature Pyramids for Physics-Informed Neural Networks DRInQ: Evaluating Conversational Implicature with Controlled Context Variation Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows Enhancing Reliability in LLM-Based Secure Code Generation Interdomain Attention: Beyond Token-Level Key-Value Memory Analyzing the Effects of Two-Stage Peer Evaluation Bayesian Rational Search Engine User How Well Do Models Follow Their Constitutions? Polar: Agentic RL on Any Harness at Scale Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making An Interactive Paradigm for Deep Research Attested Tool-Server Admission: A Security Extension to the Model Context Protocol Toward Enactive Artificial Intelligence Safety-Oriented Routing Analysis of Mixtral MoE Under Benign and Harmful Prompts Humans Cannot Detect AI-Generated Media But Communities May -- For Now: Collaborative AI Detection in r/RealOrAI on Reddit A Comprehensive Evaluation of Vertex Elimination Algorithms for Algorithmic Differentiation Program Synthesis for Non-Linear Real Arithmetic: Going Beyond Realizability MeVer at CheckThat! 2026: Cluster-Aware Hard-Negative Mining for Multilingual Scientific-Source Retrieval CoDA: Color Distribution Probing for Efficient and Generalizable AI-Generated Image Detection Cross-Modal Action Recognition in Egocentric Video Using Mamba: Integrating RGB and Hand Skeleton Streams via CLS Token Fusion Strategies When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification LEARNT: A Practical Estimator for Cardinality of LIKE Queries with Formal Accuracy Guarantees Causal Physics Steering in Video World Models via Concept Activation Vectors Five Queries Are Enough: Query-Efficient and Surrogate-Free Membership Inference Attacks on RAG via Entailment Modernizing User Privacy Preference Measurement through GPPI: A GDPR-aligned Privacy Preference Item Bank RxGS: Receiver-Generalizable 3D Gaussian Splatting for Radio-Frequency Data Synthesis Ant Backpressure Routing for Dynamic Wireless Multi-hop Networks with Mixed Traffic Patterns Improving Labeling Consistency with Detailed Constitutional Definitions and AI-Driven Evaluation ScaleAcross Explorer: Exploring Communication Optimization for Scale-Across AI Model Training PAIRED: A Process-Anchored Framework for Transparent Reporting of AI Contributions in Scientific Research QUEST: Training Frontier Deep Research Agents with Fully Synthetic Tasks A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development How Far Will They Go? Red-Teaming Online Influence with Large Language Models RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation A Reproducible Universal Dependencies-Style Pipeline for Katharevousa Greek Parliamentary Text Memorization Dynamics of Fill-in-the-Middle Pretraining A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism Brain-LLM Alignment Tracks Training Data, Not Typology NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic RMA: an Agentic System for Research-Level Mathematical Problems DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization Foundation Protocol: A Coordination Layer for Agentic Society Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse Design and Report Benchmarks for Knowledge Work ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication Parallel Context Compaction for Long-Horizon LLM Agent Serving Emotion Recognition in Sign Language Conversation Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation Cultural Adaptation in Large Language Models for Political Discourse DART: Semantic Recoverability for Structured Tool Agents Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision? From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models
Unified 3D Scene Understanding Through Physical World Modeling
Wanhee Lee, · 2026-05-26 · via cs updates on arXiv.org

View PDF

Abstract:Understanding 3D scenes requires flexible combinations of visual reasoning tasks, including depth estimation, novel view synthesis, and object manipulation, all of which are essential for perception and interaction. Existing approaches have typically addressed these tasks in isolation, preventing them from sharing a common representation or transferring knowledge across tasks. A conceptually simpler but practically non-trivial alternative is to unify these diverse tasks into a single model, reducing different tasks from separate training objectives to merely different prompts and allowing for joint training across all datasets. In this work, we present a physical world model for unified 3D understanding and interaction (3WM), formulated as a probabilistic graphical model in which nodes represent multimodal scene elements such as RGB, optical flow, and camera pose. Diverse tasks emerge from different inference pathways through the graph: novel view synthesis from RGB and dense flow prompts, object manipulation from RGB and sparse flow prompts, and depth estimation from RGB and camera conditioning, all zero-shot without task-specific training. 3WM outperforms specialized baselines without the need for finetuning by offering precise controllability, strong geometric consistency, and robustness in real-world scenarios, achieving state-of-the-art performance on NVS and 3D object manipulation. Beyond predefined tasks, the model supports composable inference pathways, such as moving objects aside while navigating a 3D environment, enabling complex geometric reasoning. This demonstrates that a unified model can serve as a practical alternative to fragmented task-specific systems, taking a step towards a general-purpose visual world model.
Comments: Published as a conference paper at ICLR 2026
Subjects: Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2605.24321 [cs.CV]
  (or arXiv:2605.24321v1 [cs.CV] for this version)
  https://doi.org/10.48550/arXiv.2605.24321

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Wanhee Lee [view email]
[v1] Sat, 23 May 2026 01:01:35 UTC (37,375 KB)