Anisotropy Is Inherent to Self-Attention in Transformers

Godey, Nathan; de la Clergerie, Éric; Sagot, Benoît

Computer Science > Computation and Language

arXiv:2401.12143 (cs)

[Submitted on 22 Jan 2024 (v1), last revised 24 Jan 2024 (this version, v2)]

Title:Anisotropy Is Inherent to Self-Attention in Transformers

Authors:Nathan Godey, Éric de la Clergerie, Benoît Sagot

View PDF HTML (experimental)

Abstract:The representation degeneration problem is a phenomenon that is widely observed among self-supervised learning methods based on Transformers. In NLP, it takes the form of anisotropy, a singular property of hidden representations which makes them unexpectedly close to each other in terms of angular distance (cosine-similarity). Some recent works tend to show that anisotropy is a consequence of optimizing the cross-entropy loss on long-tailed distributions of tokens. We show in this paper that anisotropy can also be observed empirically in language models with specific objectives that should not suffer directly from the same consequences. We also show that the anisotropy problem extends to Transformers trained on other modalities. Our observations suggest that anisotropy is actually inherent to Transformers-based models.

Comments:	Proceedings of EACL 2024. A previous version of the paper, published as arXiv:2306.07656, was presented at ACL-SRW 2023 (non-archival)
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2401.12143 [cs.CL]
	(or arXiv:2401.12143v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2401.12143

Submission history

From: Nathan Godey [view email]
[v1] Mon, 22 Jan 2024 17:26:55 UTC (14,212 KB)
[v2] Wed, 24 Jan 2024 16:07:00 UTC (14,162 KB)

Computer Science > Computation and Language

Title:Anisotropy Is Inherent to Self-Attention in Transformers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Anisotropy Is Inherent to Self-Attention in Transformers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators