Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Zsámboki, Pál; Levi, Benjamin; Smith, David Ansel Josef; Kagalwala, Mitansh; Kell, Arlington; Liechty, Samuel; Wang, Cong

Computer Science > Machine Learning

arXiv:2510.08341 (cs)

[Submitted on 9 Oct 2025]

Title:Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Authors:Pál Zsámboki, Benjamin Levi, David Ansel Josef Smith, Mitansh Kagalwala, Arlington Kell, Samuel Liechty, Cong Wang

View PDF HTML (experimental)

Abstract:We study length generalization in transformers through the set complement task, where a model must predict a uniform distribution over tokens absent from an input sequence -- an ability central to board-game style reasoning. Our main theoretical result establishes two statements. First, we prove tight bounds on embedding and value dimensions for single-layer attention-only transformers. Second, we show that if such a model achieves balanced logit displacement at lengths 1 and 2, then it must generalize to longer sequences, though with reduced precision. A mechanistic reading of the proof explains this limitation: as more tokens are attended to, softmax compresses logit displacements, eroding separation between valid and invalid outputs. Training dynamics also suggest a second obstacle: when many next tokens are possible, updates become noisy. We hypothesize that dropout can counteract the first effect and Exponential Moving Average (EMA) the second. We validate these hypotheses through random hyperparameter search on the set complement task, which confirms both mechanisms. We then test OthelloGPT, a GPT-1 style model trained on random Othello moves, and find that EMA again improves length generalization in this more complex setting.

Comments:	10 pages, 5 figures, 2 tables
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2510.08341 [cs.LG]
	(or arXiv:2510.08341v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2510.08341

Submission history

From: Pál Zsámboki [view email]
[v1] Thu, 9 Oct 2025 15:26:48 UTC (1,549 KB)

Computer Science > Machine Learning

Title:Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Learning What's Missing: Attention Dispersion and EMA Stabilization in Length Generalization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators