Skip to main content

Showing 1–47 of 47 results for author: Habets, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.09521  [pdf

    eess.AS cs.CL

    You Are What You Say: Exploiting Linguistic Content for VoicePrivacy Attacks

    Authors: Ünal Ege Gaznepoglu, Anna Leschanowsky, Ahmad Aloradi, Prachi Singh, Daniel Tenbrinck, Emanuël A. P. Habets, Nils Peters

    Abstract: Speaker anonymization systems hide the identity of speakers while preserving other information such as linguistic content and emotions. To evaluate their privacy benefits, attacks in the form of automatic speaker verification (ASV) systems are employed. In this study, we assess the impact of intra-speaker linguistic content similarity in the attacker training and evaluation datasets, by adapting B… ▽ More

    Submitted 11 June, 2025; originally announced June 2025.

    Comments: 5 pages, 6 figures, 1 table, accepted at INTERSPEECH 2025

  2. arXiv:2506.01731  [pdf, other

    eess.AS

    Benchmarking Neural Speech Codec Intelligibility with SITool

    Authors: Anna Leschanowsky, Kishor Kayyar Lakshminarayana, Anjana Rajasekhar, Lyonel Behringer, Ibrahim Kilinc, Guillaume Fuchs, Emanuël A. P. Habets

    Abstract: Speech intelligibility assessment is essential for evaluating neural speech codecs, yet most evaluation efforts focus on overall quality rather than intelligibility. Only a few publicly available tools exist for conducting standardized intelligibility tests, like the Diagnostic Rhyme Test (DRT) and Modified Rhyme Test (MRT). We introduce the Speech Intelligibility Toolkit for Subjective Evaluation… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: submitted to Interspeech

  3. arXiv:2505.19760  [pdf, ps, other

    eess.AS

    Navigating PESQ: Up-to-Date Versions and Open Implementations

    Authors: Matteo Torcoli, Mhd Modar Halimeh, Emanuël A. P. Habets

    Abstract: Perceptual Evaluation of Speech Quality (PESQ) is an objective quality measure that remains widely used despite its withdrawal by the International Telecommunication Union (ITU). PESQ has evolved over two decades, with multiple versions and publicly available implementations emerging during this time. The numerous versions and their updates can be overwhelming, especially for new PESQ users. This… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  4. arXiv:2504.00742  [pdf, other

    eess.AS

    Expanding and Analyzing ODAQ -- the Open Dataset of Audio Quality

    Authors: Sascha Dick, Christoph Thompson, Chih-Wei Wu, Matteo Torcoli, Pablo Delgado, Phillip A. Williams, Emanuel Habets

    Abstract: The Open Dataset of Audio Quality (ODAQ) was recently introduced to address the scarcity of openly available audio datasets with corresponding subjective quality scores. The dataset, released under permissive licenses, comprises audio material processed using six different signal processing methods operating at five quality levels, along with corresponding subjective test results. To expand the da… ▽ More

    Submitted 1 April, 2025; originally announced April 2025.

    Comments: Accepted for presentation at the Audio Engineering Society (AES) 157th Convention, October 2024, New York, USA

  5. arXiv:2503.03304  [pdf, ps, other

    eess.AS

    On the Relation Between Speech Quality and Quantized Latent Representations of Neural Codecs

    Authors: Mhd Modar Halimeh, Matteo Torcoli, Philipp Grundhuber, Emanuël A. P. Habets

    Abstract: Neural audio signal codecs have attracted significant attention in recent years. In essence, the impressive low bitrate achieved by such encoders is enabled by learning an abstract representation that captures the properties of encoded signals, e.g., speech. In this work, we investigate the relation between the latent representation of the input signal learned by a neural codec and the quality of… ▽ More

    Submitted 5 March, 2025; originally announced March 2025.

  6. Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron

    Authors: Kishor Kayyar Lakshminarayana, Frank Zalkow, Christian Dittmar, Nicola Pia, Emanuel A. P. Habets

    Abstract: In recent years, several text-to-speech systems have been proposed to synthesize natural speech in zero-shot, few-shot, and low-resource scenarios. However, these methods typically require training with data from many different speakers. The speech quality across the speaker set typically is diverse and imposes an upper limit on the quality achievable for the low-resource speaker. In the current w… ▽ More

    Submitted 10 January, 2025; originally announced January 2025.

    Comments: Accepted for publication at the 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025) to be held at Hyderabad, India

  7. arXiv:2410.13620  [pdf, other

    eess.AS cs.SD eess.SP

    Align-ULCNet: Towards Low-Complexity and Robust Acoustic Echo and Noise Reduction

    Authors: Shrishti Saha Shetu, Naveen Kumar Desiraju, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: The successful deployment of deep learning-based acoustic echo and noise reduction (AENR) methods in consumer devices has spurred interest in developing low-complexity solutions, while emphasizing the need for robust performance in real-life applications. In this work, we propose a hybrid approach to enhance the state-of-the-art (SOTA) ULCNet model by integrating time alignment and parallel encode… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: 5 pages, 4 figures

  8. arXiv:2410.13599  [pdf, other

    eess.AS cs.SD eess.SP

    GAN-Based Speech Enhancement for Low SNR Using Latent Feature Conditioning

    Authors: Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

    Abstract: Enhancing speech quality under adverse SNR conditions remains a significant challenge for discriminative deep neural network (DNN)-based approaches. In this work, we propose DisCoGAN, which is a time-frequency-domain generative adversarial network (GAN) conditioned by the latent features of a discriminative model pre-trained for speech enhancement in low SNR scenarios. Our proposed method achieves… ▽ More

    Submitted 17 October, 2024; originally announced October 2024.

    Comments: 5 pages, 2 figures

  9. arXiv:2409.13502  [pdf, other

    eess.AS cs.SD

    Neural Directional Filtering: Far-Field Directivity Control With a Small Microphone Array

    Authors: Julian Wechsler, Srikanth Raj Chetupalli, Mhd Modar Halimeh, Oliver Thiergart, Emanuël A. P. Habets

    Abstract: Capturing audio signals with specific directivity patterns is essential in speech communication. This study presents a deep neural network (DNN)-based approach to directional filtering, alleviating the need for explicit signal models. More specifically, our proposed method uses a DNN to estimate a single-channel complex mask from the signals of a microphone array. This mask is then applied to a re… ▽ More

    Submitted 20 September, 2024; originally announced September 2024.

    Comments: Presented at the International Workshop on Acoustic Signal Enhancement (IWAENC), 2024

  10. arXiv:2408.15746  [pdf, other

    eess.AS cs.SD

    A Hybrid Approach for Low-Complexity Joint Acoustic Echo and Noise Reduction

    Authors: Shrishti Saha Shetu, Naveen Kumar Desiraju, Jose Miguel Martinez Aponte, Emanuël A. P. Habets, Edwin Mabande

    Abstract: Deep learning-based methods that jointly perform the task of acoustic echo and noise reduction (AENR) often require high memory and computational resources, making them unsuitable for real-time deployment on low-resource platforms such as embedded devices. We propose a low-complexity hybrid approach for joint AENR by employing a single model to suppress both residual echo and noise components. Spe… ▽ More

    Submitted 28 August, 2024; originally announced August 2024.

    Comments: 5 pages, 2 figures

  11. arXiv:2408.14582  [pdf, ps, other

    eess.AS cs.SD

    Comparative Analysis Of Discriminative Deep Learning-Based Noise Reduction Methods In Low SNR Scenarios

    Authors: Shrishti Saha Shetu, Emanuël A. P. Habets, Andreas Brendel

    Abstract: In this study, we conduct a comparative analysis of deep learning-based noise reduction methods in low signal-to-noise ratio (SNR) scenarios. Our investigation primarily focuses on five key aspects: The impact of training data, the influence of various loss functions, the effectiveness of direct and indirect speech estimation techniques, the efficacy of masking, mapping, and deep filtering methodo… ▽ More

    Submitted 26 August, 2024; originally announced August 2024.

    Comments: 5 pages, 4 figures

  12. arXiv:2408.08729  [pdf, ps, other

    eess.AS cs.CL cs.SD

    ConcateNet: Dialogue Separation Using Local And Global Feature Concatenation

    Authors: Mhd Modar Halimeh, Matteo Torcoli, Emanuël Habets

    Abstract: Dialogue separation involves isolating a dialogue signal from a mixture, such as a movie or a TV program. This can be a necessary step to enable dialogue enhancement for broadcast-related applications. In this paper, ConcateNet for dialogue separation is proposed, which is based on a novel approach for processing local and global features aimed at better generalization for out-of-domain signals. C… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  13. arXiv:2407.19989  [pdf, other

    eess.AS

    Blind Acoustic Parameter Estimation Through Task-Agnostic Embeddings Using Latent Approximations

    Authors: Philipp Götz, Cagdas Tuna, Andreas Brendel, Andreas Walther, Emanuël A. P. Habets

    Abstract: We present a method for blind acoustic parameter estimation from single-channel reverberant speech. The method is structured into three stages. In the first stage, a variational auto-encoder is trained to extract latent representations of acoustic impulse responses represented as mel-spectrograms. In the second stage, a separate speech encoder is trained to estimate low-dimensional representations… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Accepted for publication at IWAENC 2024

  14. arXiv:2406.06403  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Meta Learning Text-to-Speech Synthesis in over 7000 Languages

    Authors: Florian Lux, Sarina Meyer, Lyonel Behringer, Frank Zalkow, Phat Do, Matt Coler, Emanuël A. P. Habets, Ngoc Thang Vu

    Abstract: In this work, we take on the challenging task of building a single text-to-speech synthesis system that is capable of generating speech in over 7000 languages, many of which lack sufficient data for traditional TTS development. By leveraging a novel integration of massively multilingual pretraining and meta learning to approximate language representations, our approach enables zero-shot speech syn… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: accepted at Interspeech 2024

  15. arXiv:2405.17364  [pdf, other

    eess.AS

    Speech Loudness in Broadcasting and Streaming

    Authors: Matteo Torcoli, Mhd Modar Halimeh, Thomas Leitz, Yannik Grewe, Michael Kratschmer, Bernhard Neugebauer, Adrian Murtaza, Harald Fuchs, Emanuël A. P. Habets

    Abstract: The introduction and regulation of loudness in broadcasting and streaming brought clear benefits to the audience, e.g., a level of uniformity across programs and channels. Yet, speech loudness is frequently reported as being too low in certain passages, which can hinder the full understanding and enjoyment of movies and TV programs. This paper proposes expanding the set of loudness-based measures… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: Accepted for presentation at the Audio Engineering Society (AES) 156th Convention, June 2024, Madrid, Spain

  16. arXiv:2402.06246  [pdf, other

    eess.AS cs.SD

    Data-driven Joint Detection and Localization of Acoustic Reflectors

    Authors: H. Nazim Bicer, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: Room geometry inference algorithms rely on the localization of acoustic reflectors to identify boundary surfaces of an enclosure. Rooms with highly absorptive walls or walls at large distances from the measurement setup pose challenges for such algorithms. As it is not always possible to localize all walls, we present a data-driven method to jointly detect and localize acoustic reflectors that cor… ▽ More

    Submitted 9 February, 2024; originally announced February 2024.

    Comments: 4+1(bib) Pages. Accepted to ICASSP Satellite Workshop - HSCMA 2024

  17. arXiv:2401.00197  [pdf, other

    eess.AS

    ODAQ: Open Dataset of Audio Quality

    Authors: Matteo Torcoli, Chih-Wei Wu, Sascha Dick, Phillip A. Williams, Mhd Modar Halimeh, William Wolcott, Emanuel A. P. Habets

    Abstract: Research into the prediction and analysis of perceived audio quality is hampered by the scarcity of openly available datasets of audio signals accompanied by corresponding subjective quality scores. To address this problem, we present the Open Dataset of Audio Quality (ODAQ), a new dataset containing the results of a MUSHRA listening test conducted with expert listeners from 2 international labora… ▽ More

    Submitted 30 December, 2023; originally announced January 2024.

    Comments: Accepted paper. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Seoul, Korea, April 2024

  18. arXiv:2309.03486  [pdf, other

    eess.AS cs.SD

    Simulating room transfer functions between transducers mounted on audio devices using a modified image source method

    Authors: Zeyu Xu, Adrian Herzog, Alexander Lodermeyer, Emanuël A. P. Habets, Albert G. Prinn

    Abstract: The image source method (ISM) is often used to simulate room acoustics due to its ease of use and computational efficiency. The standard ISM is limited to simulations of room impulse responses between point sources and omnidirectional receivers. In this work, the ISM is extended using spherical harmonic directivity coefficients to include acoustic diffraction effects due to source and receiver tra… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: The following article has been submitted to the Journal of the Acoustical Society of America (JASA). After it is published, it will be found at http://asa.scitation.org/journal/jas

  19. arXiv:2308.14611  [pdf, other

    eess.AS cs.SD

    Data-driven 3D Room Geometry Inference with a Linear Loudspeaker Array and a Single Microphone

    Authors: Cagdas Tuna, Altan Akat, H. Nazim Bicer, Andreas Walther, Emanuël A. P. Habets

    Abstract: Knowing the room geometry may be very beneficial for many audio applications, including sound reproduction, acoustic scene analysis, and sound source localization. Room geometry inference (RGI) deals with the problem of reflector localization (RL) based on a set of room impulse responses (RIRs). Motivated by the increasing popularity of commercially available soundbars, this article presents a dat… ▽ More

    Submitted 28 August, 2023; originally announced August 2023.

    Comments: Accepted for publication in Forum Acusticum 2023

  20. arXiv:2306.10152  [pdf, other

    eess.AS cs.SD

    Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

    Authors: Kishor Kayyar Lakshminarayana, Christian Dittmar, Nicola Pia, Emanuël Habets

    Abstract: Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentatio… ▽ More

    Submitted 16 June, 2023; originally announced June 2023.

    Comments: Accepted for publication at EUSIPCO-2023, Helsinki

  21. arXiv:2305.19100  [pdf, other

    eess.AS cs.SD

    Predicting Preferred Dialogue-to-Background Loudness Difference in Dialogue-Separated Audio

    Authors: Luca Resti, Martin Strauss, Matteo Torcoli, Emanuël Habets, Bernd Edler

    Abstract: Dialogue Enhancement (DE) enables the rebalancing of dialogue and background sounds to fit personal preferences and needs in the context of broadcast audio. When individual audio stems are unavailable from production, Dialogue Separation (DS) can be applied to the final audio mixture to obtain estimates of these stems. This work focuses on Preferred Loudness Differences (PLDs) between dialogue and… ▽ More

    Submitted 31 May, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: Paper accepted at the 15th International Conference on Quality of Multimedia Experience (QoMEX), 4 pages, 2 figures

  22. arXiv:2303.13453  [pdf, other

    eess.AS cs.SD

    Better Together: Dialogue Separation and Voice Activity Detection for Audio Personalization in TV

    Authors: Matteo Torcoli, Emanuël A. P. Habets

    Abstract: In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This i… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: Paper accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023), Rhodes, Greece

  23. arXiv:2303.08702  [pdf, other

    eess.AS cs.SD

    Beamformer-Guided Target Speaker Extraction

    Authors: Mohamed Elminshawi, Srikanth Raj Chetupalli, Emanuël A. P. Habets

    Abstract: We propose a Beamformer-guided Target Speaker Extraction (BG-TSE) method to extract a target speaker's voice from a multi-channel recording informed by the direction of arrival of the target. The proposed method employs a front-end beamformer steered towards the target speaker to provide an auxiliary signal to a single-channel TSE system. By allowing for time-varying embeddings in the single-chann… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Submitted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023)

  24. arXiv:2303.07143  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Microphone Speaker Separation by Spatial Regions

    Authors: Julian Wechsler, Srikanth Raj Chetupalli, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: We consider the task of region-based source separation of reverberant multi-microphone recordings. We assume pre-defined spatial regions with a single active source per region. The objective is to estimate the signals from the individual spatial regions as captured by a reference microphone while retaining a correspondence between signals and spatial regions. We propose a data-driven approach usin… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Submitted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing

  25. arXiv:2302.11205  [pdf, other

    eess.AS cs.SD

    Contrastive Representation Learning for Acoustic Parameter Estimation

    Authors: Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: A study is presented in which a contrastive learning approach is used to extract low-dimensional representations of the acoustic environment from single-channel, reverberant speech signals. Convolution of room impulse responses (RIRs) with anechoic source signals is leveraged as a data augmentation technique that offers considerable flexibility in the design of the upstream task. We evaluate the e… ▽ More

    Submitted 13 March, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: Accepted for ICASSP 2023, Camera-ready version

  26. arXiv:2212.13442  [pdf, other

    eess.IV cs.MM cs.SD eess.AS

    Audiovisual Database with 360 Video and Higher-Order Ambisonics Audio for Perception, Cognition, Behavior, and QoE Evaluation Research

    Authors: Thomas Robotham, Ashutosh Singla, Olli S. Rummukainen, Alexander Raake, Emanuël A. P. Habets

    Abstract: Research into multi-modal perception, human cognition, behavior, and attention can benefit from high-fidelity content that may recreate real-life-like scenes when rendered on head-mounted displays. Moreover, aspects of audiovisual perception, cognitive processes, and behavior may complement questionnaire-based Quality of Experience (QoE) evaluation of interactive virtual environments. Currently, t… ▽ More

    Submitted 27 December, 2022; originally announced December 2022.

    Comments: 6 pages, 2 figures, accepted and presented at the 2022 14th International Conference on Quality of Multimedia Experience (QoMEX). Database is publicly accessible at https://qoevave.github.io/database/

  27. arXiv:2208.03023  [pdf, other

    eess.AS cs.SD

    AID: Open-source Anechoic Interferer Dataset

    Authors: Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: A dataset of anechoic recordings of various sound sources encountered in domestic environments is presented. The dataset is intended to be a resource of non-stationary, environmental noise signals that, when convolved with acoustic impulse responses, can be used to simulate complex acoustic scenes. Additionally, a Python library is provided to generate random mixtures of the recordings in the data… ▽ More

    Submitted 5 August, 2022; originally announced August 2022.

    Comments: Accepted for publication at IWAENC 2022

  28. Dialogue Enhancement and Listening Effort in Broadcast Audio: A Multimodal Evaluation

    Authors: Matteo Torcoli, Thomas Robotham, Emanuël A. P. Habets

    Abstract: Dialogue enhancement (DE) plays a vital role in broadcasting, enabling the personalization of the relative level between foreground speech and background music and effects. DE has been shown to improve the quality of experience, intelligibility, and self-reported listening effort (LE). A physiological indicator of LE known from audiology studies is pupil size. The relation between pupil size and L… ▽ More

    Submitted 3 August, 2022; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: Paper accepted to 14th International Conference on Quality of Multimedia Experience (QoMEX), Lippstadt, Germany, 2022 - version 2 fixes some typos

  29. arXiv:2206.13808  [pdf, other

    eess.AS cs.SD

    Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

    Authors: Ahmad Aloradi, Wolfgang Mack, Mohamed Elminshawi, Emanuël A. P. Habets

    Abstract: Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional embedding from a speech utterance that encodes the speaker's voice characteristics. A speaker is verified if his/her voice embedding is sufficiently similar to the emb… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: To be presented at EUSIPCO 2022

  30. arXiv:2206.06184  [pdf, other

    eess.AS eess.SP

    AmbiSep: Ambisonic-to-Ambisonic Reverberant Speech Separation Using Transformer Networks

    Authors: Adrian Herzog, Srikanth Raj Chetupalli, Emanuël A. P. Habets

    Abstract: Consider a multichannel Ambisonic recording containing a mixture of several reverberant speech signals. Retreiving the reverberant Ambisonic signals corresponding to the individual speech sources blindly from the mixture is a challenging task as it requires to estimate multiple signal channels for each source. In this work, we propose AmbiSep, a deep neural network-based plane-wave domain masking… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: Preprint submitted to IWAENC 2022 (https://iwaenc2022.org)

  31. arXiv:2205.01897  [pdf, other

    eess.AS cs.LG cs.SD

    Virtual Analog Modeling of Distortion Circuits Using Neural Ordinary Differential Equations

    Authors: Jan Wilczek, Alec Wright, Vesa Välimäki, Emanuël Habets

    Abstract: Recent research in deep learning has shown that neural networks can learn differential equations governing dynamical systems. In this paper, we adapt this concept to Virtual Analog (VA) modeling to learn the ordinary differential equations (ODEs) governing the first-order and the second-order diode clipper. The proposed models achieve performance comparable to state-of-the-art recurrent neural net… ▽ More

    Submitted 1 July, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

    Comments: 8 pages, 10 figures, accepted for DAFx 2022 conference, for associated audio examples, see https://thewolfsound.com/publications/dafx2022/

  32. Blind Reverberation Time Estimation in Dynamic Acoustic Conditions

    Authors: Philipp Götz, Cagdas Tuna, Andreas Walther, Emanuël A. P. Habets

    Abstract: The estimation of reverberation time from real-world signals plays a central role in a wide range of applications. In many scenarios, acoustic conditions change over time which in turn requires the estimate to be updated continuously. Previously proposed methods involving deep neural networks were mostly designed and tested under the assumption of static acoustic conditions. In this work, we show… ▽ More

    Submitted 23 February, 2022; originally announced February 2022.

    Comments: accepted for publication in ICASSP 2022

  33. arXiv:2202.00733  [pdf, other

    eess.AS cs.SD

    New Insights on Target Speaker Extraction

    Authors: Mohamed Elminshawi, Wolfgang Mack, Srikanth Raj Chetupalli, Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: Speaker extraction (SE) aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information. Several forms of auxiliary information have been employed in single-channel SE, such as a speech snippet enrolled from the target speaker or visual information corresponding to the spoken utterance. The effectiveness of the auxiliary information in… ▽ More

    Submitted 15 September, 2023; v1 submitted 1 February, 2022; originally announced February 2022.

  34. Signal-Aware Direction-of-Arrival Estimation Using Attention Mechanisms

    Authors: Wolfgang Mack, Julian Wechsler, Emanuël A. P. Habets

    Abstract: The direction-of-arrival (DOA) of sound sources is an essential acoustic parameter used, e.g., for multi-channel speech enhancement or source tracking. Complex acoustic scenarios consisting of sources-of-interest, interfering sources, reverberation, and noise make the estimation of the DOAs corresponding to the sources-of-interest a challenging task. Recently proposed attention mechanisms allow DO… ▽ More

    Submitted 3 January, 2022; originally announced January 2022.

  35. arXiv:2011.04569  [pdf, other

    eess.AS cs.AI cs.SD

    Informed Source Extraction With Application to Acoustic Echo Reduction

    Authors: Mohamed Elminshawi, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference snippet uttered by the target speaker into a single embedding vector that encapsulates the characteristics of the target speaker. However, such modeling delibera… ▽ More

    Submitted 26 October, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

    Comments: Published at ITG 2021

    Report number: 978-3-8007-5627-8

  36. Efficient Training Data Generation for Phase-Based DOA Estimation

    Authors: Fabian Hübner, Wolfgang Mack, Emanuël A. P. Habets

    Abstract: Deep learning (DL) based direction of arrival (DOA) estimation is an active research topic and currently represents the state-of-the-art. Usually, DL-based DOA estimators are trained with recorded data or computationally expensive generated data. Both data types require significant storage and excessive time to, respectively, record or generate. We propose a low complexity online data generation m… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021

  37. arXiv:2011.04359  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD eess.IV

    An Empirical Study of Visual Features for DNN based Audio-Visual Speech Enhancement in Multi-talker Environments

    Authors: Shrishti Saha Shetu, Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: Audio-visual speech enhancement (AVSE) methods use both audio and visual features for the task of speech enhancement and the use of visual features has been shown to be particularly effective in multi-speaker scenarios. In the majority of deep neural network (DNN) based AVSE methods, the audio and visual data are first processed separately using different sub-networks, and then the learned feature… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

  38. arXiv:2005.00145  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching

    Authors: Alessandro Ilic Mezza, Emanuël A. P. Habets, Meinard Müller, Augusto Sarti

    Abstract: The performance of machine learning algorithms is known to be negatively affected by possible mismatches between training (source) and test (target) data distributions. In fact, this problem emerges whenever an acoustic scene classification system which has been trained on data recorded by a given device is applied to samples acquired under different acoustic conditions or captured by mismatched r… ▽ More

    Submitted 30 April, 2020; originally announced May 2020.

    Comments: 5 pages, 1 figure, 3 tables, submitted to EUSIPCO 2020

  39. Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters

    Authors: Wolfgang Mack, Emanuël A. P. Habets

    Abstract: Signal extraction from a single-channel mixture with additional undesired signals is most commonly performed using time-frequency (TF) masks. Typically, the mask is estimated with a deep neural network (DNN), and element-wise applied to the complex mixture short-time Fourier transform (STFT) representation to perform the extraction. Ideal mask magnitudes are zero for solely undesired signals in a… ▽ More

    Submitted 9 December, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

  40. Modal Decomposition of Feedback Delay Networks

    Authors: Sebastian J. Schlecht, Emanuël A. P. Habets

    Abstract: Feedback delay networks (FDNs) belong to a general class of recursive filters which are widely used in sound synthesis and physical modeling applications. We present a numerical technique to compute the modal decomposition of the FDN transfer function. The proposed pole finding algorithm is based on the Ehrlich-Aberth iteration for matrix polynomials and has improved computational performance of u… ▽ More

    Submitted 25 January, 2019; originally announced January 2019.

  41. Multi-scale aggregation of phase information for reducing computational cost of CNN based DOA estimation

    Authors: Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: In a recent work on direction-of-arrival (DOA) estimation of multiple speakers with convolutional neural networks (CNNs), the phase component of short-time Fourier transform (STFT) coefficients of the microphone signal is given as input and small filters are used to learn the phase relations between neighboring microphones. Due to this chosen filter size, $M-1$ convolution layers are required to a… ▽ More

    Submitted 20 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: text overlap with arXiv:1807.11722

  42. arXiv:1810.09708  [pdf, other

    eess.AS cs.SD eess.SP

    On the difference-to-sum power ratio of speech and wind noise based on the Corcos model

    Authors: Daniele Mirabilii, Emanuël A. P. Habets

    Abstract: The difference-to-sum power ratio was proposed and used to suppress wind noise under specific acoustic conditions. In this contribution, a general formulation of the difference-to-sum power ratio associated with a mixture of speech and wind noise is proposed and analyzed. In particular, it is assumed that the complex coherence of convective turbulence can be modelled by the Corcos model. In contra… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

    Comments: 5 pages, 3 figures, IEEE-ICSEE Eilat-Israel conference (special session)

  43. arXiv:1807.11722  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained with Noise Signals

    Authors: Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: Supervised learning based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction-of-arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation… ▽ More

    Submitted 31 July, 2018; originally announced July 2018.

  44. Simulating Multi-channel Wind Noise Based on the Corcos Model

    Authors: Daniele Mirabilii, Emanuël A. P. Habets

    Abstract: A novel multi-channel artificial wind noise generator based on a fluid dynamics model, namely the Corcos model, is proposed. In particular, the model is used to approximate the complex coherence function of wind noise signals measured with closely-spaced microphones in the free-field and for time-invariant wind stream direction and speed. Preliminary experiments focus on a spatial analysis of reco… ▽ More

    Submitted 23 July, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

    Comments: 5 pages, 2 figures, IWAENC 2018

  45. Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

    Authors: Fabian-Robert Stöter, Soumitro Chakrabarty, Bernd Edler, Emanuël A. P. Habets

    Abstract: The task of estimating the maximum number of concurrent speakers from single channel mixtures is important for various audio-based applications, such as blind source separation, speaker diarisation, audio surveillance or auditory scene classification. Building upon powerful machine learning methodology, we develop a Deep Neural Network (DNN) that estimates a speaker count. While DNNs efficiently m… ▽ More

    Submitted 15 February, 2018; v1 submitted 12 December, 2017; originally announced December 2017.

    Comments: Accepted in ICASSP 2018

  46. arXiv:1712.04276  [pdf, other

    cs.SD eess.AS stat.ML

    Multi-Speaker Localization Using Convolutional Neural Network Trained with Noise

    Authors: Soumitro Chakrabarty, Emanuël A. P. Habets

    Abstract: The problem of multi-speaker localization is formulated as a multi-class multi-label classification problem, which is solved using a convolutional neural network (CNN) based source localization method. Utilizing the common assumption of disjoint speaker activities, we propose a novel method to train the CNN using synthesized noise signals. The proposed localization method is evaluated for two spea… ▽ More

    Submitted 12 December, 2017; originally announced December 2017.

    Comments: Presented at Machine Learning for Audio Processing (ML4Audio) Workshop at NIPS 2017

  47. On Lossless Feedback Delay Networks

    Authors: Sebastian J. Schlecht, Emanuel A. P. Habets

    Abstract: Lossless Feedback Delay Networks (FDNs) are commonly used as a design prototype for artificial reverberation algorithms. The lossless property is dependent on the feedback matrix, which connects the output of a set of delays to their inputs, and the lengths of the delays. Both, unitary and triangular feedback matrices are known to constitute lossless FDNs, however, the most general class of lossle… ▽ More

    Submitted 24 June, 2016; originally announced June 2016.