惯性聚合 高效追踪和阅读你感兴趣的博客、新闻、科技资讯
阅读原文 在惯性聚合中打开

推荐订阅源

Threat Intelligence Blog | Flashpoint
Threat Intelligence Blog | Flashpoint
C
CERT Recently Published Vulnerability Notes
博客园 - 【当耐特】
有赞技术团队
有赞技术团队
Hugging Face - Blog
Hugging Face - Blog
Cisco Talos Blog
Cisco Talos Blog
爱范儿
爱范儿
cs.CL updates on arXiv.org
cs.CL updates on arXiv.org
人人都是产品经理
人人都是产品经理
大猫的无限游戏
大猫的无限游戏
博客园 - 三生石上(FineUI控件)
Martin Fowler
Martin Fowler
量子位
Cyberwarzone
Cyberwarzone
腾讯CDC
博客园 - Franky
T
The Blog of Author Tim Ferriss
U
Unit 42
Engineering at Meta
Engineering at Meta
CTFtime.org: upcoming CTF events
CTFtime.org: upcoming CTF events
小众软件
小众软件
酷 壳 – CoolShell
酷 壳 – CoolShell
G
GRAHAM CLULEY
L
LINUX DO - 最新话题
The Hacker News
The Hacker News
Security Latest
Security Latest
N
News and Events Feed by Topic
S
Schneier on Security
www.infosecurity-magazine.com
www.infosecurity-magazine.com
H
Hacker News: Front Page
Schneier on Security
Schneier on Security
O
OpenAI News
C
Cybersecurity and Infrastructure Security Agency CISA
月光博客
月光博客
美团技术团队
博客园_首页
V
V2EX
freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More
T
Tailwind CSS Blog
雷峰网
雷峰网
WordPress大学
WordPress大学
GbyAI
GbyAI
C
Cisco Blogs
I
InfoQ
L
LINUX DO - 热门话题
Simon Willison's Weblog
Simon Willison's Weblog
T
Tor Project blog
让小产品的独立变现更简单 - ezindie.com
让小产品的独立变现更简单 - ezindie.com
S
Securelist
F
Full Disclosure

cs.SD updates on arXiv.org

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models PSP: An Interpretable Per-Dimension Accent Benchmark for Indic Text-to-Speech Praxy Voice: Voice-Prompt Recovery + BUPS for Commercial-Class Indic TTS from a Frozen Non-Indic Base at Zero Commercial-Training-Data Cost Korean aegyo speech shows systematic F1 increase to signal childlike qualities All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation RAS: a Reliability Oriented Metric for Automatic Speech Recognition Speech Enhancement Based on Drifting Models HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling Robust Audio-Text Retrieval via Cross-Modal Attention and Hybrid Loss Spectro-Temporal Modulation Representation Framework for Human-Imitated Speech Detection UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions Do LLM Decoders Listen Fairly? Benchmarking How Language Model Priors Shape Bias in Speech Recognition Materialistic RIR: Material Conditioned Realistic RIR Generation SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR ATIR: Towards Audio-Text Interleaved Contextual Retrieval Enhancing Speaker Verification with Whispered Speech via Post-Processing Environmental Sound Deepfake Detection Using Deep-Learning Framework Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India Tadabur: A Large-Scale Quran Audio Dataset Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation Virtual boundary integral neural network for three-dimensional exterior acoustic problems Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use AST: Adaptive, Seamless, and Training-Free Precise Speech Editing Breakout-picker: Reducing false positives in deep learning-based borehole breakout characterization from acoustic image logs Hierarchical Codec Diffusion for Video-to-Speech Generation ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench The Acoustic Camouflage Phenomenon: Re-evaluating Speech Features for Financial Risk Prediction Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection TurboTalk: Progressive Distillation for One-Step Audio-Driven Talking Avatar Generation Temporal Contrastive Decoding: A Training-Free Method for Large Audio-Language Models VoxSafeBench: Not Just What Is Said, but Who, How, and Where Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models HHL with a Coherent Fourier Oracle: A Proof-of-Concept Quantum Architecture for Joint Melody-Harmony Generation ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing Efficient Training for Cross-lingual Speech Language Models Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation BlasBench: An Open Benchmark for Irish Speech Recognition Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music Beyond Monologue: Interactive Talking-Listening Avatar Generation with Conversational Audio Context-Aware Kernels ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence Disentangled Dual-Branch Graph Learning for Conversational Emotion Recognition Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features Woosh: A Sound Effects Foundation Model KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness From Black Box to Glass Box: Cross-Model ASR Disagreement to Prioto Review in Ambient AI Scribe Documentation Diagnostic-Driven Layer-Wise Compensation for Post-Training Quantization of Encoder-Decoder ASR Models Style Amnesia: Investigating Speaking Style Degradation and Mitigation in Multi-Turn Spoken Language Models Real-Time Streamable Generative Speech Restoration with Flow Matching Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs Protecting Bystander Privacy via Selective Hearing in Audio LLMs Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models MARS: Sound Generation via Multi-Channel Autoregression on Spectrograms Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents DreamAudio: Customized Text-to-Audio Generation with Diffusion Models Computational Narrative Understanding for Expressive Text-to-Speech Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation Sat2Sound: A Unified Framework for Zero-Shot Soundscape Mapping Histogram-based Parameter-efficient Tuning for Passive and Active Sonar Classification Speculative End-Turn Detector for Efficient Speech Chatbot Assistant AudioX: A Unified Framework for Anything-to-Audio Generation Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization Throat and acoustic paired speech dataset for deep learning-based speech enhancement Dementia classification from spontaneous speech using wrapper-based feature selection DASB - Discrete Audio and Speech Benchmark Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks
Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-robot Handovers
Vladimir Iashin, Francesca Palermo, Gökhan Solak, Claudio Coppol · 2020-12-03 · via cs.SD updates on arXiv.org

Human-robot object handover is a key skill for the future of human-robot collaboration. CORSMAL 2020 Challenge focuses on the perception part of this problem: the robot needs to estimate the filling mass of a container held by a human. Although there are powerful methods in image processing and audio processing individually, answering such a problem requires processing data from multiple sensors together. The appearance of the container, the sound of the filling, and the depth data provide essential information. We propose a multi-modal method to predict three key indicators of the filling mass: filling type, filling level, and container capacity. These indicators are then combined to estimate the filling mass of a container. Our method obtained Top-1 overall performance among all submissions to CORSMAL 2020 Challenge on both public and private subsets while showing no evidence of overfitting. Our source code is publicly available: https://github.com/v-iashin/CORSMAL