EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model

Hasan, Rashedul; Nigar, Meher; Mamun, Nursadul; Paul, Sayan

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2501.12682 (eess)

[Submitted on 22 Jan 2025]

Title:EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model

Authors:Rashedul Hasan, Meher Nigar, Nursadul Mamun, Sayan Paul

View PDF HTML (experimental)

Abstract:Speech Emotion Recognition is a crucial area of research in human-computer interaction. While significant work has been done in this field, many state-of-the-art networks struggle to accurately recognize emotions in speech when the data is both speech and speaker-independent. To address this limitation, this study proposes, EmoFormer, a hybrid model combining CNNs (CNNs) with Transformer encoders to capture emotion patterns in speech data for such independent datasets. The EmoFormer network was trained and tested using the Expressive Anechoic Recordings of Speech (EARS) dataset, recently released by META. We experimented with two feature extraction techniques: MFCCs and x-vectors. The model was evaluated on different emotion sets comprising 5, 7, 10, and 23 distinct categories. The results demonstrate that the model achieved its best performance with five emotions, attaining an accuracy of 90%, a precision of 0.92, a recall, and an F1-score of 0.91. However, performance decreased as the number of emotions increased, with an accuracy of 83% for seven emotions compared to 70% for the baseline network. This study highlights the effectiveness of combining CNNs and Transformer-based architectures for emotion recognition from speech, particularly when using MFCC features.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2501.12682 [eess.AS]
	(or arXiv:2501.12682v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2501.12682

Submission history

From: Nursadul Mamun Dr. [view email]
[v1] Wed, 22 Jan 2025 07:00:15 UTC (3,120 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:EmoFormer: A Text-Independent Speech Emotion Recognition using a Hybrid Transformer-CNN model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators