
























Abstract:Training large language models (LLMs) relies on adaptive optimizers such as Adam, which introduce extra operations and require significantly more memory to maintain first- and second-order moments than SGD. While recent works such as GaLore, Fira and APOLLO have proposed state-compressed memory-efficient variants, a fundamental question remains: What are the minimum modifications to plain SGD needed to match state-of-the-art pretraining performance? We systematically investigate this question using a bottom-up approach, and identify two simple yet highly (memory- and compute-) efficient techniques: (1) column-wise gradient normalization (normalizing the gradient along the output dimension), that boosts SGD performance without momentum; and (2) applying first-order momentum only to the output layer, where gradient variance is highest. Combining these two techniques lead to SCALE (Stochastic Column-normAlized Last-layer momEntum), a simple optimizer for memory efficient pretraining. Across multiple models (60M-1B), SCALE matches or exceeds the performance of Adam while using only 35-45% of the total memory. It also consistently outperforms memory-efficient optimizers such as GaLore, Fira and APOLLO, making it a strong candidate for large-scale pretraining under memory constraints. For LLaMA 7B, SCALE outperforms the state-of-the-art memory-efficient methods APOLLO and Muon in both perplexity and memory consumption.
| Comments: | Accepted at ICML 2026 |
| Subjects: | Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC) |
| Cite as: | arXiv:2506.16659 [cs.LG] |
| (or arXiv:2506.16659v3 [cs.LG] for this version) | |
| https://doi.org/10.48550/arXiv.2506.16659 arXiv-issued DOI via DataCite |
From: Athanasios Glentis [view email]
[v1]
Fri, 20 Jun 2025 00:10:35 UTC (287 KB)
[v2]
Wed, 10 Dec 2025 06:05:11 UTC (348 KB)
[v3]
Wed, 20 May 2026 23:37:18 UTC (583 KB)
此内容由惯性聚合(RSS阅读器)自动聚合整理,仅供阅读参考。 原文来自 — 版权归原作者所有。