





















Abstract:We study the implicit bias of momentum-based optimizers on smooth homogeneous models. We show that \textit{momentum steepest descent} algorithms like Muon (spectral norm), MomentumGD ($\ell_2$ norm), and Signum ($\ell_\infty$ norm) are \textit{approximate} steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $\ell_\infty$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the choice of optimizer. Overall, our results extend earlier lines of work on steepest descent in homogeneous models and momentum-based optimizers in linear models.
| Comments: | ICML 2026. 8 pages, 1 figure (with appendix: 45 pages, 3 figures) |
| Subjects: | Machine Learning (cs.LG); Machine Learning (stat.ML) |
| Cite as: | arXiv:2602.16340 [cs.LG] |
| (or arXiv:2602.16340v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2602.16340 arXiv-issued DOI via DataCite |
From: Eitan Gronich [view email]
[v1]
Wed, 18 Feb 2026 10:25:07 UTC (217 KB)
[v2]
Tue, 3 Mar 2026 09:25:48 UTC (217 KB)
[v3]
Sun, 24 May 2026 10:50:02 UTC (368 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。