Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

He, Bobby; Martens, James; Zhang, Guodong; Botev, Aleksandar; Brock, Andrew; Smith, Samuel L; Teh, Yee Whye

Computer Science > Machine Learning

arXiv:2302.10322 (cs)

[Submitted on 20 Feb 2023]

Title:Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Authors:Bobby He, James Martens, Guodong Zhang, Aleksandar Botev, Andrew Brock, Samuel L Smith, Yee Whye Teh

View PDF

Abstract:Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.

Comments:	ICLR 2023
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Cite as:	arXiv:2302.10322 [cs.LG]
	(or arXiv:2302.10322v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2302.10322

Submission history

From: Bobby He [view email]
[v1] Mon, 20 Feb 2023 21:26:25 UTC (3,199 KB)

Computer Science > Machine Learning

Title:Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators