Search | arXiv e-print repository

Universal Speech Enhancement with Regression and Generative Mamba

Authors: Rong Chao, Rauf Nasretdinov, Yu-Chiang Frank Wang, Ante Jukić, Szu-Wei Fu, Yu Tsao

Abstract: The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-f… ▽ More The Interspeech 2025 URGENT Challenge aimed to advance universal, robust, and generalizable speech enhancement by unifying speech enhancement tasks across a wide variety of conditions, including seven different distortion types and five languages. We present Universal Speech Enhancement Mamba (USEMamba), a state-space speech enhancement model designed to handle long-range sequence modeling, time-frequency structured processing, and sampling frequency-independent feature extraction. Our approach primarily relies on regression-based modeling, which performs well across most distortions. However, for packet loss and bandwidth extension, where missing content must be inferred, a generative variant of the proposed USEMamba proves more effective. Despite being trained on only a subset of the full training data, USEMamba achieved 2nd place in Track 1 during the blind test phase, demonstrating strong generalization across diverse conditions. △ Less

Submitted 27 May, 2025; originally announced May 2025.

Comments: Accepted to Interspeech 2025

arXiv:2505.04237 [pdf, other]

doi 10.1109/ICASSP49660.2025.10890638

Robust Speech Recognition with Schrödinger Bridge-Based Speech Enhancement

Authors: Rauf Nasretdinov, Roman Korostik, Ante Jukić

Abstract: In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schrödinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR perform… ▽ More In this work, we investigate application of generative speech enhancement to improve the robustness of ASR models in noisy and reverberant conditions. We employ a recently-proposed speech enhancement model based on Schrödinger bridge, which has been shown to perform well compared to diffusion-based approaches. We analyze the impact of model scaling and different sampling methods on the ASR performance. Furthermore, we compare the considered model with predictive and diffusion-based baselines and analyze the speech recognition performance when using different pre-trained ASR models. The proposed approach significantly reduces the word error rate, reducing it by approximately 40% relative to the unprocessed speech signals and by approximately 8% relative to a similarly sized predictive approach. △ Less

Submitted 7 May, 2025; originally announced May 2025.

Comments: 5 pages. Published in ICASSP 2025

Journal ref: ICASSP 2025: IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, April 2025. ICASSP 2025: IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, April 2025

arXiv:2208.07657 [pdf, other]

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Authors: Andrei Andrusenko, Rauf Nasretdinov, Aleksei Romanenko

Abstract: Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence… ▽ More Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer model. It consistently reduces the input sequence length by 16 times, which results in speeding up the work of the intermediate layers. To solve the convergence issue connected with such a significant reduction of the time dimension, we use upsampling blocks like in the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training. The Uconv-Conformer architecture appears to be not only faster in terms of training and inference speed but also shows better WER compared to the baseline Conformer. Our best Uconv-Conformer model shows 47.8% and 23.5% inference acceleration on the CPU and GPU, respectively. Relative WER reduction is 7.3% and 9.2% on LibriSpeech test_clean and test_other respectively. △ Less

Submitted 11 March, 2023; v1 submitted 16 August, 2022; originally announced August 2022.

Comments: 5 pages, 1 figure, accepted by ICASSP 2023

Showing 1–3 of 3 results for author: Nasretdinov, R