

















Abstract:Modern machine learning is dominated by complex, overparameterized architectures capable of interpolating data and achieving zero training loss. For such models, we investigate the convergence properties of two popular modifications to standard SGD: clipped SGD and normalized SGD. We show that under overparameterization and a mild assumption on batch size, both clipped and normalized SGD do not suffer from the bias typically introduced by clipping, converging effectively at the same rate as their deterministic counterparts. This provides a rigorous theoretical justification for the empirical success of gradient clipping methods. In our analysis, we employ the $(L_0,L_1)$-smoothness condition, under which we obtain convergence rates that improve upon the best known results in prior work. Furthermore, we extend our analysis to specific challenging regimes, including heavy-tailed noise, $(H_0,H_1)$-smoothness (which is strictly weaker than standard assumptions in optimization literature) and the deterministic regime.
| Subjects: | Optimization and Control (math.OC) |
| Cite as: | arXiv:2605.14800 [math.OC] |
| (or arXiv:2605.14800v2 [math.OC] for this version) | |
| https://doi.org/10.48550/arXiv.2605.14800 arXiv-issued DOI via DataCite |
From: Aleksandr Lobanov [view email]
[v1]
Thu, 14 May 2026 13:09:48 UTC (2,020 KB)
[v2]
Tue, 26 May 2026 16:52:56 UTC (2,025 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。