SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech

Chen, Weidong; Xing, Xiaofen; Xu, Xiangmin; Pang, Jianxin; Du, Lan

Computer Science > Sound

arXiv:2203.03812 (cs)

[Submitted on 8 Mar 2022 (v1), last revised 10 Mar 2022 (this version, v2)]

Title:SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech

Authors:Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du

View PDF

Abstract:Transformer has obtained promising results on cognitive speech signal processing field, which is of interest in various applications ranging from emotion to neurocognitive disorder analysis. However, most works treat speech signal as a whole, leading to the neglect of the pronunciation structure that is unique to speech and reflects the cognitive process. Meanwhile, Transformer has heavy computational burden due to its full attention operation. In this paper, a hierarchical efficient framework, called SpeechFormer, which considers the structural characteristics of speech, is proposed and can be served as a general-purpose backbone for cognitive speech signal processing. The proposed SpeechFormer consists of frame, phoneme, word and utterance stages in succession, each performing a neighboring attention according to the structural pattern of speech with high computational efficiency. SpeechFormer is evaluated on speech emotion recognition (IEMOCAP & MELD) and neurocognitive disorder detection (Pitt & DAIC-WOZ) tasks, and the results show that SpeechFormer outperforms the standard Transformer-based framework while greatly reducing the computational cost. Furthermore, our SpeechFormer achieves comparable results to the state-of-the-art approaches.

Comments:	5 pages, 4figures. This paper was submitted to Insterspeech 2022
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2203.03812 [cs.SD]
	(or arXiv:2203.03812v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2203.03812

Submission history

From: Weidong Chen [view email]
[v1] Tue, 8 Mar 2022 02:22:28 UTC (221 KB)
[v2] Thu, 10 Mar 2022 01:43:15 UTC (226 KB)

Computer Science > Sound

Title:SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators