惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

云风的 BLOG
云风的 BLOG
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
S
Secure Thoughts
Engineering at Meta
Engineering at Meta
Stack Overflow Blog
Stack Overflow Blog
B
Blog RSS Feed
V
Vulnerabilities – Threatpost
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
T
Tor Project blog
T
Threat Research - Cisco Blogs
GbyAI
GbyAI
T
The Blog of Author Tim Ferriss
A
About on SuperTechFans
H
Hackread – Cybersecurity News, Data Breaches, AI and More
Scott Helme
Scott Helme
M
MIT News - Artificial intelligence
V
Visual Studio Blog
L
Lohrmann on Cybersecurity
IT之家
IT之家
Jina AI
Jina AI
L
LangChain Blog
Spread Privacy
Spread Privacy
I
Intezer
E
Exploit-DB.com RSS Feed
Simon Willison's Weblog
Simon Willison's Weblog
L
LINUX DO - 热门话题
L
LINUX DO - 最新话题
U
Unit 42
C
Cisco Blogs
爱范儿
爱范儿
The Hacker News
The Hacker News
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
博客园 - 【当耐特】
C
Check Point Blog
Hugging Face - Blog
Hugging Face - Blog
F
Full Disclosure
Cyberwarzone
Cyberwarzone
N
Netflix TechBlog - Medium
S
SegmentFault 最新的问题
博客园_首页
Recorded Future
Recorded Future
Help Net Security
Help Net Security
D
Darknet – Hacking Tools, Hacker News & Cyber Security
大猫的无限游戏
大猫的无限游戏
Cyber Security Advisories - MS-ISAC
Cyber Security Advisories - MS-ISAC
G
GRAHAM CLULEY
P
Privacy International News Feed
S
Security Archives - TechRepublic
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com

cs.CV updates on arXiv.org

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference Anthropogenic Regional Adaptation in Multimodal Vision-Language Model The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models Zero-shot World Models Are Developmentally Efficient Learners Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts Demographic and Linguistic Bias Evaluation in Omnimodal Language Models Cross-Cultural Value Awareness in Large Vision-Language Models GLEaN: A Text-to-image Bias Detection Approach for Public Comprehension From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories PhysInOne: Visual Physics Learning and Reasoning in One Suite Through Their Eyes: Fixation-aligned Tuning for Personalized User Emulation Neural Distribution Prior for LiDAR Out-of-Distribution Detection Adding Another Dimension to Image-based Animal Detection Long-SCOPE: Fully Sparse Long-Range Cooperative 3D Perception CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation FIRE-CIR: Fine-grained Reasoning for Composed Fashion Image Retrieval Detecting Diffusion-generated Images via Dynamic Assembly Forests Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection Domain-generalizable Face Anti-Spoofing with Patch-based Multi-tasking and Artifact Pattern Conversion Dynamic Class-Aware Active Learning for Unbiased Satellite Image Segmentation Low-Data Supervised Adaptation Outperforms Prompting for Cloud Segmentation Under Domain Shift Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios Adaptive Dual Residual U-Net with Attention Gate and Multiscale Spatial Attention Mechanisms (ADRUwAMS) MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification BIAS: A Biologically Inspired Algorithm for Video Saliency Detection DeFakeQ: Enabling Real-Time Deepfake Detection on Edge Devices via Adaptive Bidirectional Quantization Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs CatalogStitch: Dimension-Aware and Occlusion-Preserving Object Compositing for Catalog Image Generation Post-Hoc Guidance for Consistency Models by Joint Flow Distribution Learning SenBen: Sensitive Scene Graphs for Explainable Content Moderation Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models R2G: A Multi-View Circuit Graph Benchmark Suite from RTL to GDSII State Space Models are Effective Sign Language Learners: Exploiting Phonological Compositionality for Vocabulary-Scale Recognition Accelerating Transformer-Based Monocular SLAM via Geometric Utility Scoring Deep Learning-Based Tracking and Lineage Reconstruction of Ligament Breakup Unified Multimodal Uncertain Inference Unsupervised Local Plasticity in a Multi-Frequency VisNet Hierarchy EfficientSign: An Attention-Enhanced Lightweight Architecture for Indian Sign Language Recognition InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation 3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding On Semiotic-Grounded Interpretive Evaluation of Generative Art Generative 3D Gaussian Splatting for Arbitrary-ResolutionAtmospheric Downscaling and Forecasting From Selection to Scheduling: Federated Geometry-Aware Correction Makes Exemplar Replay Work Better under Continual Dynamic Heterogeneity ViSAGE @ NTIRE 2026 Challenge on Video Saliency Prediction Needle in a Haystack: One-Class Representation Learning for Detecting Rare Malignant Cells in Computational Cytology A Semi-Automated Framework for 3D Reconstruction of Medieval Manuscript Miniatures Detection of Hate and Threat in Digital Forensics: A Case-Driven Multimodal Approach Orthogonal Quadratic Complements for Vision Transformer Feed-Forward Networks HaloProbe: Bayesian Detection and Mitigation of Object Hallucinations in Vision-Language Models Pretrain-then-Adapt: Uncertainty-Aware Test-Time Adaptation for Text-based Person Search Assessing Privacy Preservation and Utility in Online Vision-Language Models R3PM-Net: Real-time, Robust, Real-world Point Matching Network Tipiano: Cascaded Piano Hand Motion Synthesis via Fingertip Priors Belief-Aware VLM Model for Human-like Reasoning GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents B-MoE: A Body-Part-Aware Mixture-of-Experts "All Parts Matter" Approach to Micro-Action Recognition FDIF: Formula-Driven supervised Learning with Implicit Functions for 3D Medical Image Segmentation CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention BiCLIP: Domain Canonicalization via Structured Geometric Transformation Agentic Exploration of PDE Spaces using Latent Foundation Models for Parameterized Simulations MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers When & How to Write for Personalized Demand-aware Query Rewriting in Video Search Relational Visual Similarity Enhancing Geo-localization for Crowdsourced Flood Imagery via LLM-Guided Attention GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models OmniPrism: Learning Disentangled Visual Concept for Image Generation MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets SCITUNE: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions
Imitating Task and Motion Planning with Visuomotor Transformers
Murtaza Dalal, Ajay Mandlekar, Caelan Garrett, Ankur Handa, Rusl · 2023-05-26 · via cs.CV updates on arXiv.org

Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. To that end, we present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is specifically curated for imitation learning and can be used to train performant transformer-based policies. In this paper, we present a thorough study of the design decisions required to imitate TAMP and demonstrate that OPTIMUS can solve a wide variety of challenging vision-based manipulation tasks with over 70 different objects, ranging from long-horizon pick-and-place tasks, to shelf and articulated object manipulation, achieving 70 to 80% success rates. Video results and code at https://mihdalal.github.io/optimus/