Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Makishima, Naoki; Ihori, Mana; Takashima, Akihiko; Tanaka, Tomohiro; Orihashi, Shota; Masumura, Ryo

Computer Science > Sound

arXiv:2103.01463 (cs)

[Submitted on 2 Mar 2021]

Title:Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Authors:Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi, Ryo Masumura

View PDF

Abstract:We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separation mainly train the separation model on the audio-only loss, which reflects the distance between the source signals and the separated signals. However, conventional losses do not reflect the characteristics of the speech signals, including the speaker's characteristics and phonetic information, which leads to distortion or remaining noise. To address this problem, we propose the cross-modal correspondence (CMC) loss, which is based on the cooccurrence of the speech signal and the visual signal. Since the visual signal is not affected by background noise and contains speaker and phonetic information, using the CMC loss enables the audio-visual speech separation model to remove noise while preserving the speech characteristics. Experimental results demonstrate that the proposed method learns the cooccurrence on the basis of CMC loss, which improves separation performance.

Comments:	Accepted to ICASSP 2021
Subjects:	Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2103.01463 [cs.SD]
	(or arXiv:2103.01463v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2103.01463

Submission history

From: Naoki Makishima [view email]
[v1] Tue, 2 Mar 2021 04:29:26 UTC (437 KB)

Computer Science > Sound

Title:Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators