

















Abstract:Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, provably outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.
| Subjects: | Optimization and Control (math.OC); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.15522 [math.OC] |
| (or arXiv:2605.15522v2 [math.OC] for this version) | |
| https://doi.org/10.48550/arXiv.2605.15522 arXiv-issued DOI via DataCite |
From: Dmitry Kovalev [view email]
[v1]
Fri, 15 May 2026 01:43:22 UTC (36 KB)
[v2]
Tue, 26 May 2026 17:45:23 UTC (40 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。