Skip to main content

Showing 1–7 of 7 results for author: Elhilali, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2409.10819  [pdf, ps, other

    eess.AS cs.SD

    EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

    Authors: Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

    Abstract: We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling techni… ▽ More

    Submitted 19 June, 2025; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: Accepted at Interspeech 2025

  2. arXiv:2409.08425  [pdf, other

    eess.AS cs.SD

    SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

    Authors: Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

    Abstract: In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for targe… ▽ More

    Submitted 1 January, 2025; v1 submitted 12 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  3. arXiv:2406.16314  [pdf, other

    eess.AS

    DreamVoice: Text-Guided Voice Conversion

    Authors: Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali

    Abstract: Generative voice technologies are rapidly evolving, offering opportunities for more personalized and inclusive experiences. Traditional one-shot voice conversion (VC) requires a target recording during inference, limiting ease of usage in generating desired voice timbres. Text-guided generation offers an intuitive solution to convert voices to desired "DreamVoices" according to the users' needs. O… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  4. arXiv:2311.00814  [pdf, other

    cs.SD eess.AS

    Investigating Self-Supervised Deep Representations for EEG-based Auditory Attention Decoding

    Authors: Karan Thakkar, Jiarui Hai, Mounya Elhilali

    Abstract: Auditory Attention Decoding (AAD) algorithms play a crucial role in isolating desired sound sources within challenging acoustic environments directly from brain activity. Although recent research has shown promise in AAD using shallow representations such as auditory envelope and spectrogram, there has been limited exploration of deep Self-Supervised (SS) representations on a larger scale. In this… ▽ More

    Submitted 7 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2310.04567  [pdf, other

    eess.AS cs.SD

    DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

    Authors: Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali

    Abstract: Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve… ▽ More

    Submitted 9 October, 2023; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  6. arXiv:2105.13392  [pdf, other

    cs.SD cs.LG eess.AS

    Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures

    Authors: Sangwook Park, David K. Han, Mounya Elhilali

    Abstract: Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling ef… ▽ More

    Submitted 27 May, 2021; originally announced May 2021.

    Journal ref: in IEEE Transactions on Multimedia, vol. 25, pp. 4573-4585, 2023

  7. arXiv:1811.04048  [pdf, ps, other

    eess.AS cs.SD

    Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection

    Authors: Sandeep Kothinti, Keisuke Imoto, Debmalya Chakrabarty, Gregory Sell, Shinji Watanabe, Mounya Elhilali

    Abstract: Sound event detection is a challenging task, especially for scenes with multiple simultaneous events. While event classification methods tend to be fairly accurate, event localization presents additional challenges, especially when large amounts of labeled data are not available. Task4 of the 2018 DCASE challenge presents an event detection task that requires accuracy in both segmentation and reco… ▽ More

    Submitted 9 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP 2019