Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Huang, Tianjin; Hu, Haotian; Zhang, Zhenyu; Jin, Gaojie; Li, Xiang; Shen, Li; Chen, Tianlong; Liu, Lu; Wen, Qingsong; Wang, Zhangyang; Liu, Shiwei

Computer Science > Machine Learning

arXiv:2502.17055 (cs)

[Submitted on 24 Feb 2025 (v1), last revised 11 Apr 2025 (this version, v2)]

Title:Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Authors:Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu

View PDF HTML (experimental)

Abstract:This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-SPAM, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-SPAM (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical $l_2$-norm statistics; and $(3)$ inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-SPAM effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-SPAM outperforms the BF16 LLaMA-1B trained with Adam by up to $2$ perplexity. Furthermore, when both models are trained in 4-bit, Stable-SPAM achieves the same loss as Adam while requiring only about half the training steps. Code is available at this https URL.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.17055 [cs.LG]
	(or arXiv:2502.17055v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2502.17055

Submission history

From: Tianjin Huang [view email]
[v1] Mon, 24 Feb 2025 11:09:15 UTC (681 KB)
[v2] Fri, 11 Apr 2025 19:48:37 UTC (873 KB)

Computer Science > Machine Learning

Title:Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators