惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Microsoft Security Blog
Microsoft Security Blog
Forbes - Security
Forbes - Security
月光博客
月光博客
WordPress大学
WordPress大学
Last Week in AI
Last Week in AI
罗磊的独立博客
V
Visual Studio Blog
Help Net Security
Help Net Security
宝玉的分享
宝玉的分享
H
Heimdal Security Blog
The Last Watchdog
The Last Watchdog
V
V2EX - 技术
S
SegmentFault 最新的问题
爱范儿
爱范儿
C
Check Point Blog
GbyAI
GbyAI
L
LINUX DO - 最新话题
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
Martin Fowler
Martin Fowler
Google Online Security Blog
Google Online Security Blog
F
Fortinet All Blogs
www.infosecurity-magazine.com
www.infosecurity-magazine.com
Google DeepMind News
Google DeepMind News
aimingoo的专栏
aimingoo的专栏
H
Hacker News: Front Page
M
MIT News - Artificial intelligence
T
Threatpost
IT之家
IT之家
AI
AI
P
Privacy & Cybersecurity Law Blog
H
Hackread – Cybersecurity News, Data Breaches, AI and More
美团技术团队
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
Stack Overflow Blog
Stack Overflow Blog
博客园 - 叶小钗
云风的 BLOG
云风的 BLOG
The Hacker News
The Hacker News
N
News and Events Feed by Topic
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
大猫的无限游戏
大猫的无限游戏
C
CXSECURITY Database RSS Feed - CXSecurity.com
S
Security Archives - TechRepublic
T
The Blog of Author Tim Ferriss
Cloudbric
Cloudbric
博客园_首页
Hugging Face - Blog
Hugging Face - Blog
G
GRAHAM CLULEY
V
V2EX
钛媒体:引领未来商业与生活新知
钛媒体:引领未来商业与生活新知

cs.CV updates on arXiv.org

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding Seedance 2.0: Advancing Video Generation for World Complexity ROSE: Retrieval-Oriented Segmentation Enhancement SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding Training-Free Semantic Multi-Object Tracking with Vision-Language Models Towards Unconstrained Human-Object Interaction OneHOI: Unifying Human-Object Interaction Generation and Editing Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective Depth-Aware Image and Video Orientation Estimation Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias Context Sensitivity Improves Human-Machine Visual Alignment PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image DiffMagicFace: Identity Consistent Facial Editing of Real Videos A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation Temporally Consistent Long-Term Memory for 3D Single Object Tracking Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction ReConText3D: Replay-based Continual Text-to-3D Generation Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Med-CAM: Minimal Evidence for Explaining Medical Decision Making Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance Radar-Informed 3D Multi-Object Tracking under Adverse Conditions ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling AI Powered Image Analysis for Phishing Detection Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection A Study of Failure Modes in Two-Stage Human-Object Interaction Detection MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface MSGS: Multispectral 3D Gaussian Splatting SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift Why MLLMs Struggle to Determine Object Orientations The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering Bias at the End of the Score Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones Explainable Fall Detection for Elderly Monitoring via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery Indexing Multimodal Language Models for Large-scale Image Retrieval Rethinking Uncertainty in Segmentation: From Estimation to Decision 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG) Towards Patient-Specific Deformable Registration in Laparoscopic Surgery
GraspSplats: Efficient Manipulation with 3D Feature Splatting
Mazeyu Ji, Ri-Zhao Qiu, Xueyan Zou, Xiaolong Wang · 2024-09-04 · via cs.CV updates on arXiv.org

The ability for robots to perform efficient and zero-shot grasping of object parts is crucial for practical applications and is becoming prevalent with recent advances in Vision-Language Models (VLMs). To bridge the 2D-to-3D gap for representations to support such a capability, existing methods rely on neural fields (NeRFs) via differentiable rendering or point-based projection methods. However, we demonstrate that NeRFs are inappropriate for scene changes due to their implicitness and point-based methods are inaccurate for part localization without rendering-based optimization. To amend these issues, we propose GraspSplats. Using depth supervision and a novel reference feature computation method, GraspSplats generates high-quality scene representations in under 60 seconds. We further validate the advantages of Gaussian-based representation by showing that the explicit and optimized geometry in GraspSplats is sufficient to natively support (1) real-time grasp sampling and (2) dynamic and articulated object manipulation with point trackers. With extensive experiments on a Franka robot, we demonstrate that GraspSplats significantly outperforms existing methods under diverse task settings. In particular, GraspSplats outperforms NeRF-based methods like F3RM and LERF-TOGO, and 2D detection methods.