Single and Few-step Diffusion for Generative Speech Enhancement

Lay, Bunlong; Lemercier, Jean-Marie; Richter, Julius; Gerkmann, Timo

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2309.09677 (eess)

[Submitted on 18 Sep 2023 (v1), last revised 15 Jan 2024 (this version, v2)]

Title:Single and Few-step Diffusion for Generative Speech Enhancement

Authors:Bunlong Lay, Jean-Marie Lemercier, Julius Richter, Timo Gerkmann

View PDF

Abstract:Diffusion models have shown promising results in speech enhancement, using a task-adapted diffusion process for the conditional generation of clean speech given a noisy mixture. However, at test time, the neural network used for score estimation is called multiple times to solve the iterative reverse process. This results in a slow inference process and causes discretization errors that accumulate over the sampling trajectory. In this paper, we address these limitations through a two-stage training approach. In the first stage, we train the diffusion model the usual way using the generative denoising score matching loss. In the second stage, we compute the enhanced signal by solving the reverse process and compare the resulting estimate to the clean speech target using a predictive loss. We show that using this second training stage enables achieving the same performance as the baseline model using only 5 function evaluations instead of 60 function evaluations. While the performance of usual generative diffusion algorithms drops dramatically when lowering the number of function evaluations (NFEs) to obtain single-step diffusion, we show that our proposed method keeps a steady performance and therefore largely outperforms the diffusion baseline in this setting and also generalizes better than its predictive counterpart.

Comments:	copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2309.09677 [eess.AS]
	(or arXiv:2309.09677v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2309.09677

Submission history

From: Bunlong Lay [view email]
[v1] Mon, 18 Sep 2023 11:30:58 UTC (77 KB)
[v2] Mon, 15 Jan 2024 14:17:47 UTC (77 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Single and Few-step Diffusion for Generative Speech Enhancement

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Single and Few-step Diffusion for Generative Speech Enhancement

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators