
























Abstract:Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at this https URL .
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA) |
| Cite as: | arXiv:2602.03067 [cs.LG] |
| (or arXiv:2602.03067v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2602.03067 arXiv-issued DOI via DataCite |
From: Felix Xiaofeng Ye [view email]
[v1]
Tue, 3 Feb 2026 03:52:20 UTC (585 KB)
[v2]
Tue, 10 Feb 2026 15:30:25 UTC (585 KB)
[v3]
Thu, 21 May 2026 03:12:51 UTC (592 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。