



















Diffusion Engine for Musical Orchestrated Noise
Streaming Diffusion Engine for Real-Time Music Generation
DEMON is a streaming diffusion engine for music, built on ACE-Step v1.5 (2 B turbo and 5 B XL turbo). A ring buffer of in-flight generations, each carrying its own per-slot timestep schedule, is advanced by one batched decoder forward pass per tick; after warmup, at full pipeline depth, every tick produces a finished song latent. Every solver-side parameter (per-frame source preservation, velocity scaling, ODE noise injection, classifier-free guidance, x0-target morphing, channel gain) accepts a scalar or a per-frame curve at the latent's 25 Hz frame resolution, and a shared-mutable-curve lane propagates parameter changes to every in-flight slot on the next tick, independent of pipeline depth. Native TensorRT engines cover 60, 120, and 240 s song lengths; longer songs route through a sliding 60 s window. The decoder runs under TensorRT with a refit-enabled engine that hot-swaps LoRAs without rebuild.
Samples
Four captures of the running engine under different configurations: live timbre and denoise under a blend between two prompts, live timbre/structure/denoise on the 5 B XL model, a live LoRA refit, and an LLM agent driving the controls through the MCP server. Each clip plays the audio the engine generated.
TIMBRE + DENOISE
Timbre and denoise driven live, with the conditioning blended between two text prompts: acoustic deep house and a daft-punk four-to-the-floor.
XL TURBO · 5B
The XL-turbo (5 B) checkpoint with timbre, structure, and denoise manipulated live as the song plays.
LIVE LORA REFIT
Alternative-rock and funk LoRAs refit live into the running TensorRT decoder, with no engine rebuild.
MCP CONTROL
An LLM agent driving the engine through DEMON's MCP server: it reads the live control values, then writes denoise, structure, timbre, and prompt-blend updates on a two-bar cadence to evolve the remix in real time.
Experiments
These are the experiments whose results you have to hear. The full evaluation (latency, throughput, cross-GPU benchmarks, quality metrics) is in the paper; a handful of findings, though, only really land as audio. Each isolates one property of the engine: streaming parity, per-frame SDE source preservation on a shared asymptotic curve, per-tick scalar denoise on the same curve, per-slot continuity vs. a global-reset baseline, and a per-frame latent morph driven through the shared mutable state.
Stream vs. Batch parity
Bit-identical 8-step latents decoded two ways. The batch path runs a single full 60 s VAE decode; the stream path replays the same latents tick-by-tick through a 5 s windowed decode, mirroring the live pipeline. Same fixture (low-fi loop), deathstep LoRA active in both.
Batch — 8-step sequential, full 60 s decode
Stream — depth = 8 ring buffer, 5 s windowed decode
SDE source blending
The shared asymptotic curve below, driven into the SDE step's per-frame source-preservation parameter at the latent's 25 Hz frame resolution. The model runs free for most of the clip, then lands back on the source-anchored side in the final seconds. One generation per fixture, each with its paired LoRA.
Low-fi · deathstep LoRA
Inside Confusion · acoustic LoRA
Live control surface
Same trajectory, different lane: the streaming pipeline's per-tick scalar denoise input is driven along 1 − (k/N)³ instead of the per-frame SDE parameter. One fresh 0.3 s playback chunk per tick. Holds the model's free response (denoise ≈ 1.0) for most of the run and collapses back to the source as the sweep ends.
Low-fi · deathstep LoRA
Inside Confusion · acoustic LoRA
Per-slot continuity
Illustrative pair built around a denoise switch (1.0 → 0.5). Under DEMON's per-slot scheduling the output stays continuous across the drain; a StreamDiffusion-style global reset incurs ~648 ms of dead air while the depth = 8 ring buffer refills. Matches the 60/60 vs. 1/60 completion-rate result in the paper.
Per-slot heterogeneous scheduling
Global reset — ~648 ms dead air
Shared-state x0 target morph
Two cover variants of one source, A (deathcore) and B (ambient), share seed and structure, so their latents stay aligned. The x0_target_strength field, read from the same shared mutable registry every slot consults each step as the SDE curve, is driven as the per-frame swell below: it blends each frame's x0 prediction toward B's precomputed latent, gated to the refinement half of the schedule. The song swells from A into B and back, in a single generation. The blend is convex between two clean latents, so it stays inside the manifold: no re-noising, no artifact.
x0_target_strength swell — written once into the shared registry, read by every in-flight slot on every step; convex blend toward target B.A — deathcore cover (endpoint)
B — ambient cover (x0 target)
A → B → A morph — per-frame x0_target_strength swell
Performance
Measured on RTX 5090 (32 GB), ACE-Step v1.5 turbo (2 B), 8 denoising steps, flow shift 3.0, windowed VAE decode at 3 s, 60 s source. Pipeline depth trades end-to-end throughput against control latency.
| Metric | depth = 1 | depth = 4 | depth = 8 |
|---|---|---|---|
| Throughput (gen/s) | 8.9 | 11.3 | 12.3 |
| Per-tick latency | 14.0 ms | 42.8 ms | 81.1 ms |
| Submission-time parameter convergence | 112 ms | 471 ms | 649 ms |
| Shared-curve latency | 1 tick | 1 tick | 1 tick |
| Per-frame control resolution | 25 Hz (40 ms) | ||
| VAE windowed decode (3 s) | 7 ms | ||
| LoRA refit | 1.2 s, no engine rebuild | ||
Architecture
DEMON is the runtime, control surface, and acceleration layer; the diffusion model is ACE-Step v1.5, released by the ACE-Step team under MIT. The engine maintains a ring buffer of in-flight generations at staggered denoising stages. Crucially, each in-flight slot carries its own denoise scalar and its own timestep schedule: one batched decoder forward pass per tick advances slots that are simultaneously at different stages of different schedules. Native TensorRT engines cover 60, 120, and 240 s song lengths; longer songs route through a sliding 60 s window that advances at chunk boundaries.
Two control lanes coexist. Submission-time parameters (text conditioning, source audio, denoise) enter the pipeline when a new request is submitted and reach the next emptied slot within one tick; they then take effect over that slot's remaining schedule. Step-time parameters (per-frame source preservation, x0-target morph, velocity scaling, ODE noise injection, channel gain, guidance, CFG rescale, APG momentum, DCW scalers) live in a shared mutable registry that every slot reads on every forward pass; writing to that registry takes effect on the next tick for every in-flight slot at once, regardless of pipeline depth. The decoder runs under TensorRT with refit enabled, so LoRA deltas are written into the live engine without a rebuild. The paper covers the SDE derivation behind the per-frame source-preservation curve, the windowed-decode receptive-field analysis, and the TensorRT precision recipe.
Acknowledgments
ACE-Step v1.5 — the base diffusion model, VAE, text encoder, and semantic LM. Architecture, training, weights, and turbo distillation are the work of the ACE-Step team, released under MIT.
StreamDiffusion — ring-buffer streaming pattern for image diffusion (Kodaira et al., 2023), adapted here for long music latents.
DCW — Differential Correction in Wavelet domain, a post-step correction for flow-matching samplers (Yu et al., CVPR 2026), ported from ACE-Step 1.5 v0.1.7.
BibTeX
If you use DEMON, please cite both DEMON and the underlying ACE-Step model.
@software{fosdick2026demon,
author = {Fosdick, Ryan},
title = {DEMON: Diffusion Engine for Musical Orchestrated Noise},
year = {2026},
url = {https://github.com/daydreamlive/DEMON}
}
@article{acestep2026,
title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author = {Gong and others},
journal = {arXiv preprint arXiv:2602.00744},
year = {2026}
}
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。