

















Abstract:Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.
| Comments: | 33 pages, 18 figures, Accepted by ICML 2026 |
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.26632 [cs.LG] |
| (or arXiv:2605.26632v1 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2605.26632 arXiv-issued DOI via DataCite (pending registration) |
From: Xing Cong [view email]
[v1]
Tue, 26 May 2026 07:09:49 UTC (33,941 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。