

















Abstract:Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
| Comments: | Accepted at ICML 2026 |
| Subjects: | Sound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Signal Processing (eess.SP) |
| Cite as: | arXiv:2605.03929 [cs.SD] |
| (or arXiv:2605.03929v4 [cs.SD] for this version) | |
| https://doi.org/10.48550/arXiv.2605.03929 arXiv-issued DOI via DataCite |
From: Davide Marincione [view email]
[v1]
Tue, 5 May 2026 16:19:58 UTC (3,943 KB)
[v2]
Wed, 6 May 2026 09:27:42 UTC (3,940 KB)
[v3]
Sat, 9 May 2026 11:18:22 UTC (3,943 KB)
[v4]
Tue, 26 May 2026 17:01:23 UTC (4,418 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。