Skip to main content

Showing 1–10 of 10 results for author: Duret, J

Searching in archive eess. Search in all archives.
.
  1. arXiv:2407.18332  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

    Authors: Jarod Duret, Yannick Estève, Titouan Parcollet

    Abstract: Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. Although most state-of-the-art systems adopt a similar architecture to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remains an open question. T… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  2. arXiv:2407.05746  [pdf, other

    cs.AI cs.SD eess.AS

    MSP-Podcast SER Challenge 2024: L'antenne du Ventoux Multimodal Self-Supervised Learning for Speech Emotion Recognition

    Authors: Jarod Duret, Mickael Rouvier, Yannick Estève

    Abstract: In this work, we detail our submission to the 2024 edition of the MSP-Podcast Speech Emotion Recognition (SER) Challenge. This challenge is divided into two distinct tasks: Categorical Emotion Recognition and Emotional Attribute Prediction. We concentrated our efforts on Task 1, which involves the categorical classification of eight emotional states using data from the MSP-Podcast dataset. Our app… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Journal ref: Odyssey 2024, Jun 2024, Quebec, France

  3. arXiv:2407.00463  [pdf, other

    cs.LG cs.AI cs.CL cs.HC eess.AS

    Open-Source Conversational AI with SpeechBrain 1.0

    Authors: Mirco Ravanelli, Titouan Parcollet, Adel Moumen, Sylvain de Langen, Cem Subakan, Peter Plantinga, Yingzhi Wang, Pooneh Mousavi, Luca Della Libera, Artem Ploujnikov, Francesco Paissan, Davide Borra, Salah Zaiem, Zeyu Zhao, Shucong Zhang, Georgios Karakasidis, Sung-Lin Yeh, Pierre Champion, Aku Rouhe, Rudolf Braun, Florian Mai, Juan Zuluaga-Gomez, Seyed Mahed Mousavi, Andreas Nautsch, Ha Nguyen , et al. (8 additional authors not shown)

    Abstract: SpeechBrain is an open-source Conversational AI toolkit based on PyTorch, focused particularly on speech processing tasks such as speech recognition, speech enhancement, speaker recognition, text-to-speech, and much more. It promotes transparency and replicability by releasing both the pre-trained models and the complete "recipes" of code and algorithms required for training them. This paper prese… ▽ More

    Submitted 16 October, 2024; v1 submitted 29 June, 2024; originally announced July 2024.

    Comments: Accepted to the Journal of Machine Learning research (JMLR), Machine Learning Open Source Software

  4. arXiv:2406.14294  [pdf, other

    cs.SD cs.AI eess.AS

    DASB - Discrete Audio and Speech Benchmark

    Authors: Pooneh Mousavi, Luca Della Libera, Jarod Duret, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

    Comments: 9 pages, 5 tables

  5. arXiv:2406.10735  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

    Authors: Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

    Abstract: Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and N… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: 4 pages, 2 figures, 2 tables, Accepted at Interspeech 2024

  6. arXiv:2310.07279  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing expressivity transfer in textless speech-to-speech translation

    Authors: Jarod Duret, Benjamin O'Brien, Yannick Estève, Titouan Parcollet

    Abstract: Textless speech-to-speech translation systems are rapidly advancing, thanks to the integration of self-supervised learning techniques. However, existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages. Expressivity plays a vital role in conveying emotions, nuances, and cultural subtleties, thereby enhancing communic… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Journal ref: ASRU, Dec 2023, Taipei, France

  7. arXiv:2309.07478  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Direct Text to Speech Translation System using Acoustic Units

    Authors: Victoria Mingote, Pablo Gimeno, Luis Vicente, Sameer Khurana, Antoine Laurent, Jarod Duret

    Abstract: This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures

  8. arXiv:2306.17199  [pdf, other

    eess.AS cs.CL cs.SD

    Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

    Authors: Jarod Duret, Titouan Parcollet, Yannick Estève

    Abstract: We propose a method for speech-to-speech emotionpreserving translation that operates at the level of discrete speech units. Our approach relies on the use of multilingual emotion embedding that can capture affective information in a language-independent manner. We show that this embedding can be used to predict the pitch and duration of speech units in a target language, allowing us to resynthesiz… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Journal ref: Speech Synthesis Workshop (SSW), Aug 2023, Grenoble, France

  9. arXiv:2204.00803  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-end model for named entity recognition from speech without paired training data

    Authors: Salima Mdhaffar, Jarod Duret, Titouan Parcollet, Yannick Estève

    Abstract: Recent works showed that end-to-end neural approaches tend to become very popular for spoken language understanding (SLU). Through the term end-to-end, one considers the use of a single model optimized to extract semantic information directly from the speech signal. A major issue for such models is the lack of paired audio and textual data with semantic annotation. In this paper, we propose an app… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  10. arXiv:2105.04310  [pdf, other

    eess.AS cs.SD

    Study on the temporal pooling used in deep neural networks for speaker verification

    Authors: Mickael Rouvier, Pierre-Michel Bousquet, Jarod Duret

    Abstract: The x-vector architecture has recently achieved state-of-the-art results on the speaker verification task. This architecture incorporates a central layer, referred to as temporal pooling, which stacks statistical parameters of the acoustic frame distribution. This work proposes to highlight the significant effect of the temporal pooling content on the training dynamics and task performance. An eva… ▽ More

    Submitted 10 May, 2021; originally announced May 2021.