Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Mishra, Anurag

Computer Science > Computation and Language

arXiv:2505.17073 (cs)

[Submitted on 20 May 2025]

Title:Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Authors:Anurag Mishra

View PDF HTML (experimental)

Abstract:Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

Comments:	8 pages (6 content + 2 references/appendix), 6 figures, 2 tables; under review for the ACL 2025 Student Research Workshop
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2505.17073 [cs.CL]
	(or arXiv:2505.17073v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2505.17073

Submission history

From: Anurag Mishra [view email]
[v1] Tue, 20 May 2025 02:15:11 UTC (5,136 KB)

Computer Science > Computation and Language

Title:Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators