Context Compression for Auto-regressive Transformers with Sentinel Tokens

Ren, Siyu; Jia, Qi; Zhu, Kenny Q.

Computer Science > Computation and Language

arXiv:2310.08152v2 (cs)

[Submitted on 12 Oct 2023 (v1), last revised 15 Oct 2023 (this version, v2)]

Title:Context Compression for Auto-regressive Transformers with Sentinel Tokens

Authors:Siyu Ren, Qi Jia, Kenny Q. Zhu

View PDF

Abstract:The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe issues on memory footprint and inference latency. In this work, we propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones, thereby reducing both memory and computational cost when processing subsequent context. Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach over sparse attention baselines in terms of fluency, n-gram matching, and semantic similarity. At last, we comprehensively profile the benefit of context compression on improving the system throughout. Code is available at this https URL.

Comments:	To appear at EMNLP 2023 main conference
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.08152 [cs.CL]
	(or arXiv:2310.08152v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.08152

Submission history

From: Siyu Ren [view email]
[v1] Thu, 12 Oct 2023 09:18:19 UTC (271 KB)
[v2] Sun, 15 Oct 2023 09:15:02 UTC (271 KB)

Computer Science > Computation and Language

Title:Context Compression for Auto-regressive Transformers with Sentinel Tokens

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Context Compression for Auto-regressive Transformers with Sentinel Tokens

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators