Search | arXiv e-print repository

arXiv:2506.04495 [pdf, ps, other]

French Listening Tests for the Assessment of Intelligibility, Quality, and Identity of Body-Conducted Speech Enhancement

Authors: Thomas Joubaud, Julien Hauret, Véronique Zimpfer, Éric Bavu

Abstract: This study evaluates the Extreme Bandwidth Extension Network (EBEN) model on body-conduction sensors through listening tests. Using the Vibravox dataset, we assess intelligibility with a French Modified Rhyme Test, speech quality with a MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) protocol and speaker identity preservation with an A/B identification task. The experiments involved mal… ▽ More This study evaluates the Extreme Bandwidth Extension Network (EBEN) model on body-conduction sensors through listening tests. Using the Vibravox dataset, we assess intelligibility with a French Modified Rhyme Test, speech quality with a MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) protocol and speaker identity preservation with an A/B identification task. The experiments involved male and female speakers recorded with a forehead accelerometer, rigid in-ear and throat microphones. The results confirm that EBEN enhances both speech quality and intelligibility. It slightly degrades speaker identification performance when applied to female speakers' throat microphone recordings. The findings also demonstrate a correlation between Short-Time Objective Intelligibility (STOI) and perceived quality in body-conducted speech, while speaker verification using ECAPA2-TDNN aligns well with identification performance. No tested metric reliably predicts EBEN's effect on intelligibility. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Submitted to Interspeech 2025 (accepted)

arXiv:2506.04492 [pdf, ps, other]

Bringing Interpretability to Neural Audio Codecs

Authors: Samir Sadok, Julien Hauret, Éric Bavu

Abstract: The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In contrast to semantic units, acoustic units may lack interpretability because their training objectives primarily focus on reconstruction performance. This paper… ▽ More The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In contrast to semantic units, acoustic units may lack interpretability because their training objectives primarily focus on reconstruction performance. This paper proposes a two-step approach to explore the encoding of speech information within the codec tokens. The primary goal of the analysis stage is to gain deeper insight into how speech attributes such as content, identity, and pitch are encoded. The synthesis stage then trains an AnCoGen network for post-hoc explanation of codecs to extract speech attributes from the respective tokens directly. △ Less

Submitted 4 June, 2025; originally announced June 2025.

Comments: Submitted to Interspeech 2025 (accepted)

arXiv:2407.11828 [pdf, other]

Vibravox: A Dataset of French Speech Captured with Body-conduction Audio Sensors

Authors: Julien Hauret, Malo Olivier, Thomas Joubaud, Christophe Langrenne, Sarah Poirée, Véronique Zimpfer, Éric Bavu

Abstract: Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 hours per sensor of speech samples a… ▽ More Vibravox is a dataset compliant with the General Data Protection Regulation (GDPR) containing audio recordings using five different body-conduction audio sensors: two in-ear microphones, two bone conduction vibration pickups, and a laryngophone. The dataset also includes audio data from an airborne microphone used as a reference. The Vibravox corpus contains 45 hours per sensor of speech samples and physiological sounds recorded by 188 participants under different acoustic conditions imposed by a high order ambisonics 3D spatializer. Annotations about the recording conditions and linguistic transcriptions are also included in the corpus. We conducted a series of experiments on various speech-related tasks, including speech recognition, speech enhancement, and speaker verification. These experiments were carried out using state-of-the-art models to evaluate and compare their performances on signals captured by the different audio sensors offered by the Vibravox dataset, with the aim of gaining a better grasp of their individual characteristics. △ Less

Submitted 26 March, 2025; v1 submitted 16 July, 2024; originally announced July 2024.

Comments: 23 pages, 42 figures

arXiv:2303.10008 [pdf, other]

doi 10.1109/TASLP.2023.3313433

Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Authors: Julien Hauret, Thomas Joubaud, Véronique Zimpfer, Éric Bavu

Abstract: This paper presents a configurable version of Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with body-conduction microphones. We show that although these microphones significantly reduce environmental noise, this insensitivity to ambient noise happens at the expense of the bandwidth of the speech signal acquired by the wearer… ▽ More This paper presents a configurable version of Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with body-conduction microphones. We show that although these microphones significantly reduce environmental noise, this insensitivity to ambient noise happens at the expense of the bandwidth of the speech signal acquired by the wearer of the devices. The obtained captured signals therefore require the use of signal enhancement techniques to recover the full-bandwidth speech. EBEN leverages a configurable multiband decomposition of the raw captured signal. This decomposition allows the data time domain dimensions to be reduced and the full band signal to be better controlled. The multiband representation of the captured signal is processed through a U-Net-like model, which combines feature and adversarial losses to generate an enhanced speech signal. We also benefit from this original representation in the proposed configurable discriminators architecture. The configurable EBEN approach can achieve state-of-the-art enhancement results on synthetic data with a lightweight generator that allows real-time processing. △ Less

Submitted 12 September, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Comments: Accepted in IEEE/ACM Transactions on Audio, Speech and Language Processing on 14/08/2023

Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023 - Volume: 31) - pp. 3499 - 3512

arXiv:2210.14090 [pdf, other]

doi 10.1109/ICASSP49357.2023.10096301

EBEN: Extreme bandwidth extension network applied to speech signals captured with noise-resilient body-conduction microphones

Authors: Julien Hauret, Thomas Joubaud, Véronique Zimpfer, Éric Bavu

Abstract: In this paper, we present Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial network (GAN) that enhances audio measured with body-conduction microphones. This type of capture equipment suppresses ambient noise at the expense of speech bandwidth, thereby requiring signal enhancement techniques to recover the wideband speech signal. EBEN leverages a multiband decomposition of the r… ▽ More In this paper, we present Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial network (GAN) that enhances audio measured with body-conduction microphones. This type of capture equipment suppresses ambient noise at the expense of speech bandwidth, thereby requiring signal enhancement techniques to recover the wideband speech signal. EBEN leverages a multiband decomposition of the raw captured speech to decrease the data time-domain dimensions, and give better control over the full-band signal. This multiband representation is fed to a U-Net-like model, which adopts a combination of feature and adversarial losses to recover an enhanced audio signal. We also benefit from this original representation in the proposed discriminator architecture. Our approach can achieve state-of-the-art results with a lightweight generator and real-time compatible operation. △ Less

Submitted 3 March, 2023; v1 submitted 25 October, 2022; originally announced October 2022.

Comments: 5 pages, 5 figures, accepted to ICASSP 2023

Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Showing 1–5 of 5 results for author: Hauret, J