




















Abstract:Real-world videos often extend over thousands of frames. Existing generative video super-resolution (VSR) approaches, however, face two persistent challenges when processing long sequences: (1) inefficiency due to the heavy cost of multi-step denoising for full-length sequences; and (2) poor consistency is hindered by temporal decomposition that causes artifacts and discontinuities. To break these limits, we propose InfVSR, which reformulates VSR as an autoregressive-one-step-diffusion paradigm, and enables streaming inference with video diffusion priors. First, we adapt the pretrained DiT into a causal structure, maintaining both local and global coherence via rolling KV-cache and joint visual guidance. Second, we distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching. To fill the gap in long-form video evaluation, we build a new benchmark tailored for extended sequences and further introduce semantic-level metrics to comprehensively assess temporal consistency. Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods such as MGLD-VSR. Our code and models are available at this https URL.
| Comments: | Code and model are available at this https URL |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
| Cite as: | arXiv:2510.00948 [cs.CV] |
| (or arXiv:2510.00948v3 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2510.00948 arXiv-issued DOI via DataCite |
From: Kai Liu [view email]
[v1]
Wed, 1 Oct 2025 14:21:45 UTC (4,440 KB)
[v2]
Thu, 21 May 2026 11:36:22 UTC (13,402 KB)
[v3]
Fri, 22 May 2026 02:38:04 UTC (13,402 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。