Audio-Visual Speech Enhancement with Score-Based Generative Models

Richter, Julius; Frintrop, Simone; Gerkmann, Timo

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2306.01432 (eess)

[Submitted on 2 Jun 2023]

Title:Audio-Visual Speech Enhancement with Score-Based Generative Models

Authors:Julius Richter, Simone Frintrop, Timo Gerkmann

View PDF

Abstract:This paper introduces an audio-visual speech enhancement system that leverages score-based generative models, also known as diffusion models, conditioned on visual information. In particular, we exploit audio-visual embeddings obtained from a self-super\-vised learning model that has been fine-tuned on lipreading. The layer-wise features of its transformer-based encoder are aggregated, time-aligned, and incorporated into the noise conditional score network. Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality and reduces generative artifacts such as phonetic confusions with respect to the audio-only equivalent. The latter is supported by the word error rate of a downstream automatic speech recognition model, which decreases noticeably, especially at low input signal-to-noise ratios.

Comments:	Submitted to ITG Conference on Speech Communication
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
Cite as:	arXiv:2306.01432 [eess.AS]
	(or arXiv:2306.01432v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2306.01432

Submission history

From: Julius Richter [view email]
[v1] Fri, 2 Jun 2023 10:43:42 UTC (654 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Audio-Visual Speech Enhancement with Score-Based Generative Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Audio-Visual Speech Enhancement with Score-Based Generative Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators