Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Tamm, Bastiaan; Balabin, Helena; Vandenberghe, Rik; Van hamme, Hugo

doi:10.21437/Interspeech.2022-10147

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2210.00259 (eess)

[Submitted on 1 Oct 2022]

Title:Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Authors:Bastiaan Tamm, Helena Balabin, Rik Vandenberghe, Hugo Van hamme

View PDF

Abstract:Speech quality in online conferencing applications is typically assessed through human judgements in the form of the mean opinion score (MOS) metric. Since such a labor-intensive approach is not feasible for large-scale speech quality assessments in most settings, the focus has shifted towards automated MOS prediction through end-to-end training of deep neural networks (DNN). Instead of training a network from scratch, we propose to leverage the speech representations from the pre-trained wav2vec-based XLS-R model. However, the number of parameters of such a model exceeds task-specific DNNs by several orders of magnitude, which poses a challenge for resulting fine-tuning procedures on smaller datasets. Therefore, we opt to use pre-trained speech representations from XLS-R in a feature extraction rather than a fine-tuning setting, thereby significantly reducing the number of trainable model parameters. We compare our proposed XLS-R-based feature extractor to a Mel-frequency cepstral coefficient (MFCC)-based one, and experiment with various combinations of bidirectional long short term memory (Bi-LSTM) and attention pooling feedforward (AttPoolFF) networks trained on the output of the feature extractors. We demonstrate the increased performance of pre-trained XLS-R embeddings in terms a reduced root mean squared error (RMSE) on the ConferencingSpeech 2022 MOS prediction task.

Comments:	5 pages, submitted to INTERSPEECH 2022
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2210.00259 [eess.AS]
	(or arXiv:2210.00259v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2210.00259
Journal reference:	Proc. Interspeech 2022, 4083-4087
Related DOI:	https://doi.org/10.21437/Interspeech.2022-10147

Submission history

From: Bastiaan Tamm [view email]
[v1] Sat, 1 Oct 2022 11:51:06 UTC (199 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators