


























Abstract:We present InstructFX2FX, an interactive demonstration of multi-turn, text-guided audio effect refinement. Existing text-to-preset systems are largely single-shot, mapping one textual descriptor to one preset. Real audio engineering is instead sequential: engineers refine an existing effect chain through successive instructions such as "make it warmer" or "too harsh, soften it." This poses a stateful problem that single-shot systems do not address: given the current effect parameters and a new instruction, update the sound while preserving what earlier instructions already achieved. InstructFX2FX tackles this with a hybrid architecture that divides labor between a language model and CLAP-guided optimization. The LLM acts as a high-level planner that selects effects, orders the signal chain, and proposes initial parameters, motivated by recent evidence that LLM initialization outperforms CLAP optimization from a random start; CLAP-guided optimization then refines those parameters perceptually, giving more controllable updates than re-prompting the LLM at each turn, which tends to perturb settings the user did not ask to change. At the demo, attendees drive a dry recording through successive natural-language instructions and inspect the evolving FX chain, intermediate optimization checkpoints, and session state. On SocialFX-derived descriptor transitions, CLAP-guided refinement lowers target-directed DSP-feature MMD on 9 of 10 directed descriptor pairs relative to an LLM-only reprompting baseline, reducing the average MMD by roughly 24% (0.45 to 0.34).
From: Song-Ze Yu [view email]
[v1]
Sat, 20 Jun 2026 12:06:13 UTC (500 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。