Search | arXiv e-print repository

arXiv:2505.19760 [pdf, ps, other]

Navigating PESQ: Up-to-Date Versions and Open Implementations

Authors: Matteo Torcoli, Mhd Modar Halimeh, Emanuël A. P. Habets

Abstract: Perceptual Evaluation of Speech Quality (PESQ) is an objective quality measure that remains widely used despite its withdrawal by the International Telecommunication Union (ITU). PESQ has evolved over two decades, with multiple versions and publicly available implementations emerging during this time. The numerous versions and their updates can be overwhelming, especially for new PESQ users. This… ▽ More Perceptual Evaluation of Speech Quality (PESQ) is an objective quality measure that remains widely used despite its withdrawal by the International Telecommunication Union (ITU). PESQ has evolved over two decades, with multiple versions and publicly available implementations emerging during this time. The numerous versions and their updates can be overwhelming, especially for new PESQ users. This work provides practical guidance on the different versions and implementations of PESQ. We show that differences can be significant, especially between PESQ versions. We stress the importance of specifying the exact version and implementation that is used to compute PESQ, and possibly to detail how multi-channel signals are handled. These practices would facilitate the interpretation of results and allow comparisons of PESQ scores between different studies. We also provide a repository that implements the latest corrections to PESQ, i.e., Corrigendum 2, which is not implemented by any other openly available distribution: https://github.com/audiolabs/PESQ. △ Less

Submitted 26 May, 2025; originally announced May 2025.

arXiv:2503.03304 [pdf, ps, other]

On the Relation Between Speech Quality and Quantized Latent Representations of Neural Codecs

Authors: Mhd Modar Halimeh, Matteo Torcoli, Philipp Grundhuber, Emanuël A. P. Habets

Abstract: Neural audio signal codecs have attracted significant attention in recent years. In essence, the impressive low bitrate achieved by such encoders is enabled by learning an abstract representation that captures the properties of encoded signals, e.g., speech. In this work, we investigate the relation between the latent representation of the input signal learned by a neural codec and the quality of… ▽ More Neural audio signal codecs have attracted significant attention in recent years. In essence, the impressive low bitrate achieved by such encoders is enabled by learning an abstract representation that captures the properties of encoded signals, e.g., speech. In this work, we investigate the relation between the latent representation of the input signal learned by a neural codec and the quality of speech signals. To do so, we introduce Latent-representation-to-Quantization error Ratio (LQR) measures, which quantify the distance from the idealized neural codec's speech signal model for a given speech signal. We compare the proposed metrics to intrusive measures as well as data-driven supervised methods using two subjective speech quality datasets. This analysis shows that the proposed LQR correlates strongly (up to 0.9 Pearson's correlation) with the subjective quality of speech. Despite being a non-intrusive metric, this yields a competitive performance with, or even better than, other pre-trained and intrusive measures. These results show that LQR is a promising basis for more sophisticated speech quality measures. △ Less

Submitted 5 March, 2025; originally announced March 2025.

arXiv:2409.13502 [pdf, other]

Neural Directional Filtering: Far-Field Directivity Control With a Small Microphone Array

Authors: Julian Wechsler, Srikanth Raj Chetupalli, Mhd Modar Halimeh, Oliver Thiergart, Emanuël A. P. Habets

Abstract: Capturing audio signals with specific directivity patterns is essential in speech communication. This study presents a deep neural network (DNN)-based approach to directional filtering, alleviating the need for explicit signal models. More specifically, our proposed method uses a DNN to estimate a single-channel complex mask from the signals of a microphone array. This mask is then applied to a re… ▽ More Capturing audio signals with specific directivity patterns is essential in speech communication. This study presents a deep neural network (DNN)-based approach to directional filtering, alleviating the need for explicit signal models. More specifically, our proposed method uses a DNN to estimate a single-channel complex mask from the signals of a microphone array. This mask is then applied to a reference microphone to render a signal that exhibits a desired directivity pattern. We investigate the training dataset composition and its effect on the directivity realized by the DNN during inference. Using a relatively small DNN, the proposed method is found to approximate the desired directivity pattern closely. Additionally, it allows for the realization of higher-order directivity patterns using a small number of microphones, which is a difficult task for linear and parametric directional filtering. △ Less

Submitted 20 September, 2024; originally announced September 2024.

Comments: Presented at the International Workshop on Acoustic Signal Enhancement (IWAENC), 2024

arXiv:2408.08729 [pdf, ps, other]

ConcateNet: Dialogue Separation Using Local And Global Feature Concatenation

Authors: Mhd Modar Halimeh, Matteo Torcoli, Emanuël Habets

Abstract: Dialogue separation involves isolating a dialogue signal from a mixture, such as a movie or a TV program. This can be a necessary step to enable dialogue enhancement for broadcast-related applications. In this paper, ConcateNet for dialogue separation is proposed, which is based on a novel approach for processing local and global features aimed at better generalization for out-of-domain signals. C… ▽ More Dialogue separation involves isolating a dialogue signal from a mixture, such as a movie or a TV program. This can be a necessary step to enable dialogue enhancement for broadcast-related applications. In this paper, ConcateNet for dialogue separation is proposed, which is based on a novel approach for processing local and global features aimed at better generalization for out-of-domain signals. ConcateNet is trained using a noise reduction-focused, publicly available dataset and evaluated using three datasets: two noise reduction-focused datasets (in-domain), which show competitive performance for ConcateNet, and a broadcast-focused dataset (out-of-domain), which verifies the better generalization performance for the proposed architecture compared to considered state-of-the-art noise-reduction methods. △ Less

Submitted 16 August, 2024; originally announced August 2024.

arXiv:2405.17364 [pdf, other]

Speech Loudness in Broadcasting and Streaming

Authors: Matteo Torcoli, Mhd Modar Halimeh, Thomas Leitz, Yannik Grewe, Michael Kratschmer, Bernhard Neugebauer, Adrian Murtaza, Harald Fuchs, Emanuël A. P. Habets

Abstract: The introduction and regulation of loudness in broadcasting and streaming brought clear benefits to the audience, e.g., a level of uniformity across programs and channels. Yet, speech loudness is frequently reported as being too low in certain passages, which can hinder the full understanding and enjoyment of movies and TV programs. This paper proposes expanding the set of loudness-based measures… ▽ More The introduction and regulation of loudness in broadcasting and streaming brought clear benefits to the audience, e.g., a level of uniformity across programs and channels. Yet, speech loudness is frequently reported as being too low in certain passages, which can hinder the full understanding and enjoyment of movies and TV programs. This paper proposes expanding the set of loudness-based measures typically used in the industry. We focus on speech loudness, and we show that, when clean speech is not available, Deep Neural Networks (DNNs) can be used to isolate the speech signal and so to accurately estimate speech loudness, providing a more precise estimate compared to speech-gated loudness. Moreover, we define critical passages, i.e., passages in which speech is likely to be hard to understand. Critical passages are defined based on the local Speech Loudness Deviation (SLD) and the local Speech-to-Background Loudness Difference (SBLD), as SLD and SBLD significantly contribute to intelligibility and listening effort. In contrast to other more comprehensive measures of intelligibility and listening effort, SLD and SBLD can be straightforwardly measured, are intuitive, and, most importantly, can be easily controlled by adjusting the speech level in the mix or by enabling personalization at the user's end. Finally, examples are provided that show how the detection of critical passages can support the evaluation and control of the speech signal during and after content production. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: Accepted for presentation at the Audio Engineering Society (AES) 156th Convention, June 2024, Madrid, Spain

arXiv:2401.00197 [pdf, other]

ODAQ: Open Dataset of Audio Quality

Authors: Matteo Torcoli, Chih-Wei Wu, Sascha Dick, Phillip A. Williams, Mhd Modar Halimeh, William Wolcott, Emanuel A. P. Habets

Abstract: Research into the prediction and analysis of perceived audio quality is hampered by the scarcity of openly available datasets of audio signals accompanied by corresponding subjective quality scores. To address this problem, we present the Open Dataset of Audio Quality (ODAQ), a new dataset containing the results of a MUSHRA listening test conducted with expert listeners from 2 international labora… ▽ More Research into the prediction and analysis of perceived audio quality is hampered by the scarcity of openly available datasets of audio signals accompanied by corresponding subjective quality scores. To address this problem, we present the Open Dataset of Audio Quality (ODAQ), a new dataset containing the results of a MUSHRA listening test conducted with expert listeners from 2 international laboratories. ODAQ contains 240 audio samples and corresponding quality scores. Each audio sample is rated by 26 listeners. The audio samples are stereo audio signals sampled at 44.1 or 48 kHz and are processed by a total of 6 method classes, each operating at different quality levels. The processing method classes are designed to generate quality degradations possibly encountered during audio coding and source separation, and the quality levels for each method class span the entire quality range. The diversity of the processing methods, the large span of quality levels, the high sampling frequency, and the pool of international listeners make ODAQ particularly suited for further research into subjective and objective audio quality. The dataset is released with permissive licenses, and the software used to conduct the listening test is also made publicly available. △ Less

Submitted 30 December, 2023; originally announced January 2024.

Comments: Accepted paper. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Seoul, Korea, April 2024

arXiv:2210.15512 [pdf, other]

doi 10.1109/ICASSP49357.2023.10095196

Exploiting spatial information with the informed complex-valued spatial autoencoder for target speaker extraction

Authors: Annika Briegleb, Mhd Modar Halimeh, Walter Kellermann

Abstract: In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the spatial filtering performed by such a time-varying spectro-spatial filter. We extend the recently proposed c… ▽ More In conventional multichannel audio signal enhancement, spatial and spectral filtering are often performed sequentially. In contrast, it has been shown that for neural spatial filtering a joint approach of spectro-spatial filtering is more beneficial. In this contribution, we investigate the spatial filtering performed by such a time-varying spectro-spatial filter. We extend the recently proposed complex-valued spatial autoencoder (COSPA) for the task of target speaker extraction by leveraging its interpretable structure and purposefully informing the network of the target speaker's position. We show that the resulting informed COSPA (iCOSPA) effectively and flexibly extracts a target speaker from a mixture of speakers. We also find that the proposed architecture is well capable of learning pronounced spatial selectivity patterns and show that the results depend significantly on the training target and the reference signal when computing various evaluation metrics. △ Less

Submitted 14 March, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: Accepted to 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece. 5 pages, 2 figures

arXiv:2108.03130 [pdf, other]

Complex-valued Spatial Autoencoders for Multichannel Speech Enhancement

Authors: Mhd Modar Halimeh, Walter Kellermann

Abstract: In this contribution, we present a novel online approach to multichannel speech enhancement. The proposed method estimates the enhanced signal through a filter-and-sum framework. More specifically, complex-valued masks are estimated by a deep complex-valued neural network, termed the complex-valued spatial autoencoder. The proposed network is capable of exploiting as well as manipulating both the… ▽ More In this contribution, we present a novel online approach to multichannel speech enhancement. The proposed method estimates the enhanced signal through a filter-and-sum framework. More specifically, complex-valued masks are estimated by a deep complex-valued neural network, termed the complex-valued spatial autoencoder. The proposed network is capable of exploiting as well as manipulating both the phase and the amplitude of the microphone signals. As shown by the experimental results, the proposed approach is able to exploit both spatial and spectral characteristics of the desired source signal resulting in a physically plausible spatial selectivity and superior speech quality compared to other baseline methods. △ Less

Submitted 6 August, 2021; originally announced August 2021.

arXiv:2012.08867 [pdf, ps, other]

doi 10.23919/EUSIPCO54536.2021.9616295

A Synergistic Kalman- and Deep Postfiltering Approach to Acoustic Echo Cancellation

Authors: Thomas Haubner, Mhd. Modar Halimeh, Andreas Brendel, Walter Kellermann

Abstract: We introduce a synergistic approach to double-talk robust acoustic echo cancellation combining adaptive Kalman filtering with a deep neural network-based postfilter. The proposed algorithm overcomes the well-known limitations of Kalman filter-based adaptation control in scenarios characterized by abrupt echo path changes. As the key innovation, we suggest to exploit the different statistical prope… ▽ More We introduce a synergistic approach to double-talk robust acoustic echo cancellation combining adaptive Kalman filtering with a deep neural network-based postfilter. The proposed algorithm overcomes the well-known limitations of Kalman filter-based adaptation control in scenarios characterized by abrupt echo path changes. As the key innovation, we suggest to exploit the different statistical properties of the interfering signal components for robustly estimating the adaptation step size. This is achieved by leveraging the postfilter near-end estimate and the estimation error of the Kalman filter. The proposed synergistic scheme allows for rapid reconvergence of the adaptive filter after abrupt echo path changes without compromising the steady state performance achieved by state-of-the-art approaches in static scenarios. △ Less

Submitted 4 March, 2022; v1 submitted 16 December, 2020; originally announced December 2020.

Comments: Accepted for European Signal Processing Conference (EUSIPCO), Dublin, Ireland, August 2021

Showing 1–9 of 9 results for author: Halimeh, M M