





















Abstract:We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it.
In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.
| Comments: | 6 pages, 5 figures |
| Subjects: | Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) |
| ACM classes: | I.2.10; I.2.11 |
| Cite as: | arXiv:2605.26064 [cs.CV] |
| (or arXiv:2605.26064v1 [cs.CV] for this version) | |
| https://doi.org/10.48550/arXiv.2605.26064 arXiv-issued DOI via DataCite (pending registration) |
From: Marcos Villagra [view email]
[v1]
Mon, 25 May 2026 17:27:22 UTC (2,417 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。