Skip to main content

Showing 1–26 of 26 results for author: Essid, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2502.17527  [pdf, other

    cs.SD cs.AI eess.AS eess.SP

    Perceptual Noise-Masking with Music through Deep Spectral Envelope Shaping

    Authors: Clémentine Berger, Roland Badeau, Slim Essid

    Abstract: People often listen to music in noisy environments, seeking to isolate themselves from ambient sounds. Indeed, a music signal can mask some of the noise's frequency components due to the effect of simultaneous masking. In this article, we propose a neural network based on a psychoacoustic masking model, designed to enhance the music's ability to mask ambient noise by reshaping its spectral envelop… ▽ More

    Submitted 24 February, 2025; originally announced February 2025.

    Journal ref: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Apr 2025, Hyderabad, India

  2. arXiv:2412.01488  [pdf, other

    eess.AS cs.LG eess.IV

    TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization

    Authors: Hugo Malard, Michel Olvera, Stephane Lathuiliere, Slim Essid

    Abstract: Large-scale pre-trained audio and image models demonstrate an unprecedented degree of generalization, making them suitable for a wide range of applications. Here, we tackle the specific task of sound-prompted segmentation, aiming to segment image regions corresponding to objects heard in an audio signal. Most existing approaches tackle this problem by fine-tuning pre-trained models or by training… ▽ More

    Submitted 26 May, 2025; v1 submitted 2 December, 2024; originally announced December 2024.

  3. arXiv:2411.18497  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Multiple Choice Learning for Efficient Speech Separation with Many Speakers

    Authors: David Perera, François Derrida, Théo Mariotte, Gaël Richard, Slim Essid

    Abstract: Training speech separation models in the supervised setting raises a permutation problem: finding the best assignation between the model predictions and the ground truth separated signals. This inherently ambiguous task is customarily solved using Permutation Invariant Training (PIT). In this article, we instead consider using the Multiple Choice Learning (MCL) framework, which was originally intr… ▽ More

    Submitted 27 November, 2024; originally announced November 2024.

  4. arXiv:2411.04152  [pdf, other

    eess.AS cs.SD

    A Contrastive Self-Supervised Learning scheme for beat tracking amenable to few-shot learning

    Authors: Antonin Gagnere, Geoffroy Peeters, Slim Essid

    Abstract: In this paper, we propose a novel Self-Supervised-Learning scheme to train rhythm analysis systems and instantiate it for few-shot beat tracking. Taking inspiration from the Contrastive Predictive Coding paradigm, we propose to train a Log-Mel-Spectrogram Transformer encoder to contrast observations at times separated by hypothesized beat intervals from those that are not. We do this without the k… ▽ More

    Submitted 6 November, 2024; originally announced November 2024.

    Journal ref: ISMIR 2024, Nov 2024, San Francisco, Californ, United States

  5. arXiv:2410.05997  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

    Authors: Hugo Malard, Michel Olvera, Stéphane Lathuiliere, Slim Essid

    Abstract: Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for audio captioning, where the joint image-language decoder can be leveraged to describe auditory content associated with image sequences within videos featuring aud… ▽ More

    Submitted 8 October, 2024; originally announced October 2024.

  6. arXiv:2409.13676  [pdf, ps, other

    cs.SD cs.AI eess.AS

    A sound description: Exploring prompt templates and class descriptions to enhance zero-shot audio classification

    Authors: Michel Olvera, Paraskevas Stamatiadis, Slim Essid

    Abstract: Audio-text models trained via contrastive learning offer a practical approach to perform audio classification through natural language prompts, such as "this is a sound of" followed by category names. In this work, we explore alternative prompt templates for zero-shot audio classification, demonstrating the existence of higher-performing options. First, we find that the formatting of the prompts s… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

    Comments: DCASE 2024 - 9th Workshop on Detection and Classification of Acoustic Scenes and Events, Oct 2024, Tokyo, Japan

  7. arXiv:2409.11746  [pdf, other

    cs.SD eess.AS

    SALT: Standardized Audio event Label Taxonomy

    Authors: Paraskevas Stamatiadis, Michel Olvera, Slim Essid

    Abstract: Machine listening systems often rely on fixed taxonomies to organize and label audio data, key for training and evaluating deep neural networks (DNNs) and other supervised algorithms. However, such taxonomies face significant constraints: they are composed of application-dependent predefined categories, which hinders the integration of new or varied sounds, and exhibits limited cross-dataset compa… ▽ More

    Submitted 18 September, 2024; originally announced September 2024.

    Journal ref: DCASE, Oct 2024, Tokyo, Japan

  8. arXiv:2407.15580  [pdf, other

    cs.LG cs.SD eess.AS math.PR stat.ML

    Annealed Multiple Choice Learning: Overcoming limitations of Winner-takes-all with annealing

    Authors: David Perera, Victor Letzelter, Théo Mariotte, Adrien Cortés, Mickael Chen, Slim Essid, Gaël Richard

    Abstract: We introduce Annealed Multiple Choice Learning (aMCL) which combines simulated annealing with MCL. MCL is a learning framework handling ambiguous tasks by predicting a small set of plausible hypotheses. These hypotheses are trained using the Winner-takes-all (WTA) scheme, which promotes the diversity of the predictions. However, this scheme may converge toward an arbitrarily suboptimal local minim… ▽ More

    Submitted 17 January, 2025; v1 submitted 22 July, 2024; originally announced July 2024.

    Comments: NeurIPS 2024

  9. arXiv:2407.00756  [pdf, other

    eess.AS cs.SD

    Less Forgetting for Better Generalization: Exploring Continual-learning Fine-tuning Methods for Speech Self-supervised Representations

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Despite being trained on massive and diverse datasets, speech self-supervised encoders are generally used for downstream purposes as mere frozen feature extractors or model initializers before fine-tuning. The former severely limits the exploitation of large encoders, while the latter hurts the robustness acquired during pretraining, especially in low-resource scenarios. This work explores middle-… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: 5 Pages

  10. arXiv:2406.04706  [pdf, other

    cs.LG cs.NE eess.SP math.PR stat.ML

    Winner-takes-all learners are geometry-aware conditional density estimators

    Authors: Victor Letzelter, David Perera, Cédric Rommel, Mathieu Fontaine, Slim Essid, Gael Richard, Patrick Pérez

    Abstract: Winner-takes-all training is a simple learning paradigm, which handles ambiguous tasks by predicting a set of plausible hypotheses. Recently, a connection was established between Winner-takes-all training and centroidal Voronoi tessellations, showing that, once trained, hypotheses should quantize optimally the shape of the conditional distribution to predict. However, the best use of these hypothe… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: International Conference on Machine Learning, Jul 2024, Vienne (Autriche), Austria

  11. arXiv:2404.08022  [pdf, other

    cs.SD eess.AS

    A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

    Authors: Thomas Serre, Mathieu Fontaine, Éric Benhaim, Geoffroy Dutour, Slim Essid

    Abstract: Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embed… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Accepted at HSCMA24, Satellite workshop of ICASSP24

    Journal ref: ICASSP, Apr 2024, Seoul (Korea), South Korea

  12. arXiv:2402.00067  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Online speaker diarization of meetings guided by speech separation

    Authors: Elio Gruttadauria, Mathieu Fontaine, Slim Essid

    Abstract: Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarizatio… ▽ More

    Submitted 30 January, 2024; originally announced February 2024.

    Comments: Accepted at ICASSP 2024

    Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr 2024, Seoul (Korea), South Korea

  13. arXiv:2312.14005  [pdf, ps, other

    cs.SD cs.AI eess.AS

    On the choice of the optimal temporal support for audio classification with Pre-trained embeddings

    Authors: Aurian Quelennec, Michel Olvera, Geoffroy Peeters, Slim Essid

    Abstract: Current state-of-the-art audio analysis systems rely on pre-trained embedding models, often used off-the-shelf as (frozen) feature extractors. Choosing the best one for a set of tasks is the subject of many recent publications. However, one aspect often overlooked in these works is the influence of the duration of audio input considered to extract an embedding, which we refer to as Temporal Suppor… ▽ More

    Submitted 21 December, 2023; originally announced December 2023.

    Comments: Copyright 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  14. arXiv:2308.14456  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Speech Self-Supervised Representations Benchmarking: a Case for Larger Probing Heads

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) leverages large datasets of unlabeled speech to reach impressive performance with reduced amounts of annotated data. The high number of proposed approaches fostered the emergence of comprehensive benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. However, while the number of considered tasks has bee… ▽ More

    Submitted 21 February, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

    Comments: 18 Pages

  15. arXiv:2307.16582  [pdf, other

    eess.AS cs.SD

    SAMbA: Speech enhancement with Asynchronous ad-hoc Microphone Arrays

    Authors: Nicolas Furnon, Romain Serizel, Slim Essid, Irina Illina

    Abstract: Speech enhancement in ad-hoc microphone arrays is often hindered by the asynchronization of the devices composing the microphone array. Asynchronization comes from sampling time offset and sampling rate offset which inevitably occur when the microphones are embedded in different hardware components. In this paper, we propose a deep neural network (DNN)-based speech enhancement solution that is sui… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: Submitted to INTERSPEECH 2022

  16. arXiv:2306.00481  [pdf, other

    eess.AS cs.LG

    Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models even with small annotated datasets. Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets. To address this issue, we propose a novel supervised domain adaptation method, designe… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 6 pages,INTERSPEECH 2023

  17. arXiv:2306.00452  [pdf, ps, other

    eess.AS cs.LG

    Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?

    Authors: Salah Zaiem, Youcef Kemiche, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has recently allowed leveraging large datasets of unlabeled speech signals to reach impressive performance on speech tasks using only small amounts of annotated data. The high number of proposed approaches fostered the need and rise of extended benchmarks that evaluate their performance on a set of downstream tasks exploring various aspects of the speech signal. Howe… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: 6 pages

    Journal ref: INTERSPEECH 2023

  18. arXiv:2303.06740  [pdf, other

    eess.AS cs.LG

    Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

    Authors: Salah Zaiem, Robin Algayres, Titouan Parcollet, Slim Essid, Mirco Ravanelli

    Abstract: Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial for achieving lower downstream ASR error rates. Thus, better performance might be sanctioned with longer inferences. This article explores different approaches… ▽ More

    Submitted 12 March, 2023; originally announced March 2023.

    Comments: Submitted to ICASSP "Self-supervision in Audio, Speech and Beyond" workshop

  19. arXiv:2204.04170  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation techniques are usually exploited to help enforce desired invariances within the learned representations, improving performance on various audio tasks thanks to mo… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  20. arXiv:2107.00594  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Pretext Tasks selection for multitask self-supervised speech representation learning

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid, Abdel Heba

    Abstract: Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features (a.k.a pseudo-labels) has proven to be a particularl… ▽ More

    Submitted 11 November, 2022; v1 submitted 1 July, 2021; originally announced July 2021.

  21. arXiv:2106.07939  [pdf, other

    eess.SP cs.SD eess.AS

    Attention-based distributed speech enhancement for unconstrained microphone arrays with varying number of nodes

    Authors: Nicolas Furnon, Romain Serizel, Slim Essid, Irina Illina

    Abstract: Speech enhancement promises higher efficiency in ad-hoc microphone arrays than in constrained microphone arrays thanks to the wide spatial coverage of the devices in the acoustic scene. However, speech enhancement in ad-hoc microphone arrays still raises many challenges. In particular, the algorithms should be able to handle a variable number of microphones, as some devices in the array might appe… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Journal ref: European Signal Processing Conference (EUSIPCO), IEEE, Aug 2021, Dublin, Ireland

  22. arXiv:2104.07388  [pdf, other

    eess.AS cs.LG cs.SD eess.SP

    Conditional independence for pretext task selection in Self-supervised speech representation learning

    Authors: Salah Zaiem, Titouan Parcollet, Slim Essid

    Abstract: Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing fea… ▽ More

    Submitted 1 July, 2021; v1 submitted 15 April, 2021; originally announced April 2021.

    Comments: 5 pages, Accepted for presentation at Interspeech2021

  23. arXiv:2011.01714  [pdf, other

    eess.SP

    DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays

    Authors: Nicolas Furnon, Romain Serizel, Irina Illina, Slim Essid

    Abstract: Deep neural network (DNN)-based speech enhancement algorithms in microphone arrays have now proven to be efficient solutions to speech understanding and speech recognition in noisy environments. However, in the context of ad-hoc microphone arrays, many challenges remain and raise the need for distributed processing. In this paper, we propose to extend a previously introduced distributed DNN-based… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Submitted to TASLP

  24. arXiv:2011.00982  [pdf, other

    eess.SP

    Distributed speech separation in spatially unconstrained microphone arrays

    Authors: Nicolas Furnon, Romain Serizel, Irina Illina, Slim Essid

    Abstract: Speech separation with several speakers is a challenging task because of the non-stationarity of the speech and the strong signal similarity between interferent sources. Current state-of-the-art solutions can separate well the different sources using sophisticated deep neural networks which are very tedious to train. When several microphones are available, spatial information can be exploited to d… ▽ More

    Submitted 8 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Journal ref: ICASSP 2021 - 46th International Conference on Acoustics, Speech, and Signal Processing, Jun 2021, Toronto, Canada

  25. arXiv:2002.06016  [pdf, other

    cs.SD cs.AI eess.AS

    DNN-Based Distributed Multichannel Mask Estimation for Speech Enhancement in Microphone Arrays

    Authors: Nicolas Furnon, Romain Serizel, Irina Illina, Slim Essid

    Abstract: Multichannel processing is widely used for speech enhancement but several limitations appear when trying to deploy these solutions to the real-world. Distributed sensor arrays that consider several devices with a few microphones is a viable alternative that allows for exploiting the multiple devices equipped with microphones that we are using in our everyday life. In this context, we propose to ex… ▽ More

    Submitted 16 March, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

    Comments: Submitted to ICASSP2020

    Journal ref: International Conference on Audio, Signal and Speech Processing (ICASSP), May 2020, Barcelone, Spain

  26. arXiv:1804.07345  [pdf, other

    cs.CV cs.SD eess.AS

    Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events

    Authors: Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Pérez, Gaël Richard

    Abstract: Audio-visual representation learning is an important task from the perspective of designing machines with the ability to understand complex events. To this end, we propose a novel multimodal framework that instantiates multiple instance learning. We show that the learnt representations are useful for classifying events and localizing their characteristic audio-visual elements. The system is traine… ▽ More

    Submitted 9 July, 2018; v1 submitted 19 April, 2018; originally announced April 2018.