Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers
Authors:
Amit Ben-Artzy,
Roy Schwartz
Abstract:
In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidd…
▽ More
In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer k with random vectors. Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model-k is in the final 30-50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance. We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word "Italy" with "France" in "What is the capital of Italy?". We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering "Rome"). However if we apply it before, the model conforms to the switch ("Paris"). Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.
△ Less
Submitted 31 October, 2024; v1 submitted 5 September, 2024;
originally announced September 2024.
Compositionally Graded SS316 to C300 Maraging Steel using Additive Manufacturing
Authors:
A. Ben-Artzy,
A. Reichardt,
J-P. Borgonia,
R. P. Dillon,
B. McEnerney,
A. A. Shapiro,
P. Hosemann
Abstract:
Joining of dissimilar metals is required for numerous applications in industries such as chemical, energy and automotive. It is challenging due to differences in melting point, density, and thermal expansion of the metals being joined. Common welding techniques involve limiting melting and solidification to a narrow area leading to high thermal stresses and potentially brittle intermetallic phases…
▽ More
Joining of dissimilar metals is required for numerous applications in industries such as chemical, energy and automotive. It is challenging due to differences in melting point, density, and thermal expansion of the metals being joined. Common welding techniques involve limiting melting and solidification to a narrow area leading to high thermal stresses and potentially brittle intermetallic phases. Furthermore, the geometric complexity of these welded joints can be rather limited. Additive Manufacturing (AM) presents new techniques for joining of dissimilar metals. One of the emerging methods is the building of functionally graded parts using Directed Energy Deposition (DED) to spatially vary composition. In this paper, a SS316L and C300 maraging steel couple were joined by DED and heat treated. 13 discrete composition layers were selected using metallurgical considerations, in order to ensure a smooth transition in properties and microstructure. The mechanical properties of the as-built joints were found to be similar to the SS part and no intermetallic phases were found in the interface.
△ Less
Submitted 20 January, 2021;
originally announced January 2021.