-
Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios
Authors:
Nils L. Westhausen,
Hendrik Kayser,
Theresa Jansen,
Bernd T. Meyer
Abstract:
Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorit…
▽ More
Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.
△ Less
Submitted 30 October, 2024; v1 submitted 3 May, 2024;
originally announced May 2024.
-
Binaural multichannel blind speaker separation with a causal low-latency and low-complexity approach
Authors:
Nils L. Westhausen,
Bernd T. Meyer
Abstract:
In this paper, we introduce a causal low-latency low-complexity approach for binaural multichannel blind speaker separation in noisy reverberant conditions.
The model, referred to as Group Communication Binaural Filter and Sum Network (GCBFSnet) predicts complex filters for filter-and-sum beamforming in the time-frequency domain.
We apply Group Communication (GC),
i.e., latent model variable…
▽ More
In this paper, we introduce a causal low-latency low-complexity approach for binaural multichannel blind speaker separation in noisy reverberant conditions.
The model, referred to as Group Communication Binaural Filter and Sum Network (GCBFSnet) predicts complex filters for filter-and-sum beamforming in the time-frequency domain.
We apply Group Communication (GC),
i.e., latent model variables are split into groups and processed with a shared sequence model with the aim of reducing the complexity of a simple model only containing one convolutional and one recurrent module.
With GC we are able to reduce the size of the model by up to 83 % and the complexity up to 73 % compared to the model without GC, while mostly retaining performance.
Even for the smallest model configuration, GCBFSnet matches the performance of a low-complexity TasNet baseline in most metrics despite the larger size and higher number of required operations of the baseline.
△ Less
Submitted 8 December, 2023;
originally announced December 2023.
-
Low bit rate binaural link for improved ultra low-latency low-complexity multichannel speech enhancement in Hearing Aids
Authors:
Nils L. Westhausen,
Bernd T. Meyer
Abstract:
Speech enhancement in hearing aids is a challenging task since the hardware limits the number of possible operations and the latency needs to be in the range of only a few milliseconds. We propose a deep-learning model compatible with these limitations, which we refer to as Group-Communication Filter-and-Sum Network (GCFSnet). GCFSnet is a causal multiple-input single output enhancement model usin…
▽ More
Speech enhancement in hearing aids is a challenging task since the hardware limits the number of possible operations and the latency needs to be in the range of only a few milliseconds. We propose a deep-learning model compatible with these limitations, which we refer to as Group-Communication Filter-and-Sum Network (GCFSnet). GCFSnet is a causal multiple-input single output enhancement model using filter-and-sum processing in the time-frequency domain and a multi-frame deep post filter. All filters are complex-valued and are estimated by a deep-learning model using weight-sharing through Group Communication and quantization-aware training for reducing model size and computational footprint. For a further increase in performance, a low bit rate binaural link for delayed binaural features is proposed to use binaural information while retaining a latency of 2ms. The performance of an oracle binaural LCMV beamformer in non-low-latency configuration can be matched even by a unilateral configuration of the GCFSnet in terms of objective metrics.
△ Less
Submitted 17 July, 2023;
originally announced July 2023.
-
Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping
Authors:
Paul M. Reuter,
Christian Rollwage,
Bernd T. Meyer
Abstract:
In this paper, we propose a multilingual query-by-example keyword spotting (KWS) system based on a residual neural network. The model is trained as a classifier on a multilingual keyword dataset extracted from Common Voice sentences and fine-tuned using circle loss. We demonstrate the generalization ability of the model to new languages and report a mean reduction in EER of 59.2 % for previously s…
▽ More
In this paper, we propose a multilingual query-by-example keyword spotting (KWS) system based on a residual neural network. The model is trained as a classifier on a multilingual keyword dataset extracted from Common Voice sentences and fine-tuned using circle loss. We demonstrate the generalization ability of the model to new languages and report a mean reduction in EER of 59.2 % for previously seen and 47.9 % for unseen languages compared to a competitive baseline. We show that the word embeddings learned by the KWS model can be accurately predicted from the phoneme sequences using a simple LSTM model. Our system achieves a promising accuracy for streaming keyword spotting and keyword search on Common Voice audio using just 5 examples per keyword. Experiments on the Hey-Snips dataset show a good performance with a false negative rate of 5.4 % at only 0.1 false alarms per hour.
△ Less
Submitted 19 April, 2023;
originally announced April 2023.
-
tPLCnet: Real-time Deep Packet Loss Concealment in the Time Domain Using a Short Temporal Context
Authors:
Nils L. Westhausen,
Bernd T. Meyer
Abstract:
This paper introduces a real-time time-domain packet loss concealment (PLC) neural-network (tPLCnet). It efficiently predicts lost frames from a short context buffer in a sequence-to-one (seq2one) fashion. Because of its seq2one structure, a continuous inference of the model is not required since it can be triggered when packet loss is actually detected. It is trained on 64h of open-source speech…
▽ More
This paper introduces a real-time time-domain packet loss concealment (PLC) neural-network (tPLCnet). It efficiently predicts lost frames from a short context buffer in a sequence-to-one (seq2one) fashion. Because of its seq2one structure, a continuous inference of the model is not required since it can be triggered when packet loss is actually detected. It is trained on 64h of open-source speech data and packet-loss traces of real calls provided by the Audio PLC Challenge. The model with the lowest complexity described in this paper reaches a robust PLC performance and consistent improvements over the zero-filling baseline for all metrics. A configuration with higher complexity is submitted to the PLC Challenge and shows a performance increase of 1.07 compared to the zero-filling baseline in terms of PLC-MOS on the blind test set and reaches a competitive 3rd place in the challenge ranking.
△ Less
Submitted 4 April, 2022;
originally announced April 2022.
-
Prediction of speech intelligibility with DNN-based performance measures
Authors:
Angel Mario Castro Martinez,
Constantin Spille,
Jana Roßbach,
Birger Kollmeier,
Bernd T. Meyer
Abstract:
This paper presents a speech intelligibility model based on automatic speech recognition (ASR), combining phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities. This model does not require the clean speech reference nor the word labels during testing as the ASR decoding step, which finds the most likely sequence…
▽ More
This paper presents a speech intelligibility model based on automatic speech recognition (ASR), combining phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities. This model does not require the clean speech reference nor the word labels during testing as the ASR decoding step, which finds the most likely sequence of words given phoneme posterior probabilities, is omitted. The model is evaluated via the root-mean-squared error between the predicted and observed speech reception thresholds from eight normal-hearing listeners. The recognition task consists of identifying noisy words from a German matrix sentence test. The speech material was mixed with eight noise maskers covering different modulation types, from speech-shaped stationary noise to a single-talker masker. The prediction performance is compared to five established models and an ASR-model using word labels. Two combinations of features and networks were tested. Both include temporal information either at the feature level (amplitude modulation filterbanks and a feed-forward network) or captured by the architecture (mel-spectrograms and a time-delay deep neural network, TDNN). The TDNN model is on par with the DNN while reducing the number of parameters by a factor of 37; this optimization allows parallel streams on dedicated hearing aid hardware as a forward-pass can be computed within the 10ms of each frame. The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
△ Less
Submitted 17 March, 2022;
originally announced March 2022.
-
Reduction of Subjective Listening Effort for TV Broadcast Signals with Recurrent Neural Networks
Authors:
Nils L. Westhausen,
Rainer Huber,
Hannah Baumgartner,
Ragini Sinha,
Jan Rennies,
Bernd T. Meyer
Abstract:
Listening to the audio of TV broadcast signals can be challenging for hearing-impaired as well as normal-hearing listeners, especially when background sounds are prominent or too loud compared to the speech signal. This can result in a reduced satisfaction and increased listening effort of the listeners. Since the broadcast sound is usually premixed, we perform a subjective evaluation for quantify…
▽ More
Listening to the audio of TV broadcast signals can be challenging for hearing-impaired as well as normal-hearing listeners, especially when background sounds are prominent or too loud compared to the speech signal. This can result in a reduced satisfaction and increased listening effort of the listeners. Since the broadcast sound is usually premixed, we perform a subjective evaluation for quantifying the potential of speech enhancement systems based on audio source separation and recurrent neural networks (RNN). Recently, RNNs have shown promising results in the context of sound source separation and real-time signal processing. In this paper, we separate the speech from the background signals and remix the separated sounds at a higher signal-to-noise ratio. This differs from classic speech enhancement, where usually only the extracted speech signal is exploited. The subjective evaluation with 20 normal-hearing subjects on real TV-broadcast material shows that our proposed enhancement system is able to reduce the listening effort by around 2 points on a 13-point listening effort rating scale and increases the perceived sound quality compared to the original mixture.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Acoustic echo cancellation with the dual-signal transformation LSTM network
Authors:
Nils L. Westhausen,
Bernd T. Meyer
Abstract:
This paper applies the dual-signal transformation LSTM network (DTLN) to the task of real-time acoustic echo cancellation (AEC). The DTLN combines a short-time Fourier transformation and a learned feature representation in a stacked network approach, which enables robust information processing in the time-frequency and in the time domain, which also includes phase information. The model is only tr…
▽ More
This paper applies the dual-signal transformation LSTM network (DTLN) to the task of real-time acoustic echo cancellation (AEC). The DTLN combines a short-time Fourier transformation and a learned feature representation in a stacked network approach, which enables robust information processing in the time-frequency and in the time domain, which also includes phase information. The model is only trained on 60~h of real and synthetic echo scenarios. The training setup includes multi-lingual speech, data augmentation, additional noise and reverberation to create a model that should generalize well to a large variety of real-world conditions. The DTLN approach produces state-of-the-art performance on clean and noisy echo conditions reducing acoustic echo and additional noise robustly. The method outperforms the AEC-Challenge baseline by 0.30 in terms of Mean Opinion Score (MOS).
△ Less
Submitted 23 November, 2020; v1 submitted 27 October, 2020;
originally announced October 2020.
-
EEG-based Auditory Attention Decoding: Towards Neuro-Steered Hearing Devices
Authors:
Simon Geirnaert,
Servaas Vandecappelle,
Emina Alickovic,
Alain de Cheveigné,
Edmund Lalor,
Bernd T. Meyer,
Sina Miran,
Tom Francart,
Alexander Bertrand
Abstract:
People suffering from hearing impairment often have difficulties participating in conversations in so-called `cocktail party' scenarios with multiple people talking simultaneously. Although advanced algorithms exist to suppress background noise in these situations, a hearing device also needs information on which of these speakers the user actually aims to attend to. The correct (attended) speaker…
▽ More
People suffering from hearing impairment often have difficulties participating in conversations in so-called `cocktail party' scenarios with multiple people talking simultaneously. Although advanced algorithms exist to suppress background noise in these situations, a hearing device also needs information on which of these speakers the user actually aims to attend to. The correct (attended) speaker can then be enhanced using this information, and all other speakers can be treated as background noise. Recent neuroscientific advances have shown that it is possible to determine the focus of auditory attention from non-invasive neurorecording techniques, such as electroencephalography (EEG). Based on these new insights, a multitude of auditory attention decoding (AAD) algorithms have been proposed, which could, combined with the appropriate speaker separation algorithms and miniaturized EEG sensor devices, lead to so-called neuro-steered hearing devices. In this paper, we provide a broad review and a statistically grounded comparative study of EEG-based AAD algorithms and address the main signal processing challenges in this field.
△ Less
Submitted 23 April, 2021; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Dual-Signal Transformation LSTM Network for Real-Time Noise Suppression
Authors:
Nils L. Westhausen,
Bernd T. Meyer
Abstract:
This paper introduces a dual-signal transformation LSTM network (DTLN) for real-time speech enhancement as part of the Deep Noise Suppression Challenge (DNS-Challenge). This approach combines a short-time Fourier transform (STFT) and a learned analysis and synthesis basis in a stacked-network approach with less than one million parameters. The model was trained on 500 h of noisy speech provided by…
▽ More
This paper introduces a dual-signal transformation LSTM network (DTLN) for real-time speech enhancement as part of the Deep Noise Suppression Challenge (DNS-Challenge). This approach combines a short-time Fourier transform (STFT) and a learned analysis and synthesis basis in a stacked-network approach with less than one million parameters. The model was trained on 500 h of noisy speech provided by the challenge organizers. The network is capable of real-time processing (one frame in, one frame out) and reaches competitive results. Combining these two types of signal transformations enables the DTLN to robustly extract information from magnitude spectra and incorporate phase information from the learned feature basis. The method shows state-of-the-art performance and outperforms the DNS-Challenge baseline by 0.24 points absolute in terms of the mean opinion score (MOS).
△ Less
Submitted 22 October, 2020; v1 submitted 15 May, 2020;
originally announced May 2020.
-
DNN-Based Speech Presence Probability Estimation for Multi-Frame Single-Microphone Speech Enhancement
Authors:
Marvin Tammen,
Dörte Fischer,
Bernd T. Meyer,
Simon Doclo
Abstract:
Multi-frame approaches for single-microphone speech enhancement, e.g., the multi-frame minimum-power-distortionless-response (MFMPDR) filter, are able to exploit speech correlations across neighboring time frames. In contrast to single-frame approaches such as the Wiener gain, it has been shown that multi-frame approaches achieve a substantial noise reduction with hardly any speech distortion, pro…
▽ More
Multi-frame approaches for single-microphone speech enhancement, e.g., the multi-frame minimum-power-distortionless-response (MFMPDR) filter, are able to exploit speech correlations across neighboring time frames. In contrast to single-frame approaches such as the Wiener gain, it has been shown that multi-frame approaches achieve a substantial noise reduction with hardly any speech distortion, provided that an accurate estimate of the correlation matrices and especially the speech interframe correlation (IFC) vector is available. Typical estimation procedures of the IFC vector require an estimate of the speech presence probability (SPP) in each time-frequency (TF) bin. In this paper, we propose to use a bi-directional long short-term memory deep neural network (DNN) to estimate the SPP for each TF bin. Aiming at achieving a robust performance, the DNN is trained for various noise types and within a large signal-to-noise-ratio range. Experimental results show that the MFMPDR in combination with the proposed data-driven SPP estimator yields an increased speech quality compared to a state-of-the-art model-based SPP estimator. Furthermore, it is confirmed that exploiting interframe correlations in the MFMPDR is beneficial when compared to the Wiener gain especially in adverse scenarios.
△ Less
Submitted 14 November, 2022; v1 submitted 21 May, 2019;
originally announced May 2019.