





















Abstract:Loss spikes commonly emerge during neural network training with the Adam optimizer across diverse architectures and scales, yet their underlying mechanism remains elusive. While previous explanations attribute these phenomena to sharper loss landscapes at lower loss, we show that landscape geometry alone is insufficient to explain the phenomenon. In this work, we pinpoint the root cause in the internal dynamics of Adam's second moment estimator. We identify a critical ``decoupling'' mechanism where the adaptive preconditioner $v_t$ fails to track the instantaneous squared gradients $g_t^2$, causing the adaptive mechanism to effectively fail. This decoupling allows the preconditioner to decay autonomously despite rising gradients, which pushes the maximum eigenvalue of the preconditioned Hessian beyond the stability threshold $2/\eta$ for sustained periods, manifesting as dramatic loss spikes. Through a quadratic approximation analysis, we theoretically and experimentally characterize five distinct stages of spike evolution and propose a predictor for anticipating spikes based on gradient-directional curvature. We empirically find that the proposed loss spike mechanism, although derived from simplified models, generalizes well to practical scenarios ranging from small neural networks to large-scale Transformers.
| Comments: | Accepted to ICML 2026 |
| Subjects: | Machine Learning (cs.LG) |
| Cite as: | arXiv:2506.04805 [cs.LG] |
| (or arXiv:2506.04805v2 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2506.04805 arXiv-issued DOI via DataCite |
From: Zhiwei Bai [view email]
[v1]
Thu, 5 Jun 2025 09:31:41 UTC (3,041 KB)
[v2]
Mon, 25 May 2026 11:37:32 UTC (4,827 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。