惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

T
Tailwind CSS Blog
OSCHINA 社区最新新闻
OSCHINA 社区最新新闻
S
SegmentFault 最新的问题
U
Unit 42
C
Cyber Attacks, Cyber Crime and Cyber Security
Security Latest
Security Latest
L
LINUX DO - 最新话题
The Register - Security
The Register - Security
人人都是产品经理
人人都是产品经理
美团技术团队
PCI Perspectives
PCI Perspectives
奇客Solidot–传递最新科技情报
奇客Solidot–传递最新科技情报
cs.AI updates on arXiv.org
cs.AI updates on arXiv.org
W
WeLiveSecurity
F
Full Disclosure
Application and Cybersecurity Blog
Application and Cybersecurity Blog
Cloudbric
Cloudbric
L
LangChain Blog
K
KPMG report finds enterprise disconnect between AI and its ROI | CIO
M
MIT News - Artificial intelligence
S
Security @ Cisco Blogs
博客园 - 【当耐特】
Webroot Blog
Webroot Blog
Stack Overflow Blog
Stack Overflow Blog
C
Check Point Blog
Help Net Security
Help Net Security
NISL@THU
NISL@THU
WordPress大学
WordPress大学
Simon Willison's Weblog
Simon Willison's Weblog
月光博客
月光博客
C
CERT Recently Published Vulnerability Notes
博客园 - 三生石上(FineUI控件)
S
Securelist
博客园 - Franky
博客园 - 叶小钗
AWS News Blog
AWS News Blog
D
DataBreaches.Net
P
Proofpoint News Feed
小众软件
小众软件
C
Cybersecurity and Infrastructure Security Agency CISA
Hugging Face - Blog
Hugging Face - Blog
Engineering at Meta
Engineering at Meta
Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
H
Hackread – Cybersecurity News, Data Breaches, AI and More
The GitHub Blog
The GitHub Blog
K
Kaspersky official blog
Vercel News
Vercel News
Google Online Security Blog
Google Online Security Blog
C
Cisco Blogs
S
Security Affairs

cs.CV updates on arXiv.org

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding Seedance 2.0: Advancing Video Generation for World Complexity ROSE: Retrieval-Oriented Segmentation Enhancement SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding Training-Free Semantic Multi-Object Tracking with Vision-Language Models Towards Unconstrained Human-Object Interaction OneHOI: Unifying Human-Object Interaction Generation and Editing Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective Depth-Aware Image and Video Orientation Estimation Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction Blind Bitstream-corrupted Video Recovery via Metadata-guided Diffusion Model Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias Context Sensitivity Improves Human-Machine Visual Alignment PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image DiffMagicFace: Identity Consistent Facial Editing of Real Videos A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation Temporally Consistent Long-Term Memory for 3D Single Object Tracking Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction ReConText3D: Replay-based Continual Text-to-3D Generation Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs Med-CAM: Minimal Evidence for Explaining Medical Decision Making Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance Radar-Informed 3D Multi-Object Tracking under Adverse Conditions ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling AI Powered Image Analysis for Phishing Detection Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer's Disease Progression FiRe: Fine-grained Multimodal Reasoning for Enhanced Image Generation MyoVision: A Mobile Research Tool and NEATBoost-Attention Ensemble Framework for Real Time Chicken Breast Myopathy Detection A Study of Failure Modes in Two-Stage Human-Object Interaction Detection MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface MSGS: Multispectral 3D Gaussian Splatting SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift Why MLLMs Struggle to Determine Object Orientations The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering Bias at the End of the Score Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones Explainable Fall Detection for Elderly Monitoring via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery Indexing Multimodal Language Models for Large-scale Image Retrieval Rethinking Uncertainty in Segmentation: From Estimation to Decision 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG) Towards Patient-Specific Deformable Registration in Laparoscopic Surgery
ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling
Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman · 2024-04-25 · via cs.CV updates on arXiv.org

An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to intelligently select acoustic data sampling locations. We propose active acoustic sampling, a new task for efficiently building an environment acoustic model of an unmapped environment in which a mobile agent equipped with visual and acoustic sensors jointly constructs the environment acoustic model and the occupancy map on-the-fly. We introduce ActiveRIR, a reinforcement learning (RL) policy that leverages information from audio-visual sensor streams to guide agent navigation and determine optimal acoustic data sampling positions, yielding a high quality acoustic model of the environment from a minimal set of acoustic samples. We train our policy with a novel RL reward based on information gain in the environment acoustic model. Evaluating on diverse unseen indoor environments from a state-of-the-art acoustic simulation platform, ActiveRIR outperforms an array of methods--both traditional navigation agents based on spatial novelty and visual exploration as well as existing state-of-the-art methods.