Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Yang, Zi; Choudhary, Samridhi; Kunzmann, Siegfried; Zhang, Zheng

doi:10.21437/Interspeech.2023-2045

Computer Science > Computation and Language

arXiv:2306.01076 (cs)

[Submitted on 1 Jun 2023 (v1), last revised 8 Jul 2023 (this version, v2)]

Title:Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Authors:Zi Yang, Samridhi Choudhary, Siegfried Kunzmann, Zheng Zhang

View PDF

Abstract:Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to $63\times$ compression ratio, little accuracy loss and remarkable inference and training speedup.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.01076 [cs.CL]
	(or arXiv:2306.01076v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.01076
Journal reference:	Interspeech 2023
Related DOI:	https://doi.org/10.21437/Interspeech.2023-2045

Submission history

From: Zi Yang [view email]
[v1] Thu, 1 Jun 2023 18:32:08 UTC (595 KB)
[v2] Sat, 8 Jul 2023 04:29:09 UTC (183 KB)

Computer Science > Computation and Language

Title:Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators