Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Li, Guinan; Deng, Jiajun; Geng, Mengzhe; Jin, Zengrui; Wang, Tianzi; Hu, Shujie; Cui, Mingyu; Meng, Helen; Liu, Xunying

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2307.02909 (eess)

[Submitted on 6 Jul 2023]

Title:Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Authors:Guinan Li, Jiajun Deng, Mengzhe Geng, Zengrui Jin, Tianzi Wang, Shujie Hu, Mingyu Cui, Helen Meng, Xunying Liu

View PDF

Abstract:Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end and Conformer ASR back-end. Audio-visual integrated front-end architectures performing speech separation and dereverberation in a pipelined or joint fashion via mask-based WPD are investigated. The error cost mismatch between the speech enhancement front-end and ASR back-end components is minimized by end-to-end jointly fine-tuning using either the ASR cost function alone, or its interpolation with the speech enhancement loss. Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset. The proposed audio-visual multi-channel speech separation, dereverberation and recognition systems consistently outperformed the comparable audio-only baseline by 9.1% and 6.2% absolute (41.7% and 36.0% relative) word error rate (WER) reductions. Consistent speech enhancement improvements were also obtained on PESQ, STOI and SRMR scores.

Comments:	IEEE/ACM Transactions on Audio, Speech, and Language Processing
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
Cite as:	arXiv:2307.02909 [eess.AS]
	(or arXiv:2307.02909v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2307.02909

Submission history

From: Guinan Li [view email]
[v1] Thu, 6 Jul 2023 10:50:46 UTC (1,813 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators