Skip to main content

Showing 1–14 of 14 results for author: Obin, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2410.23320  [pdf, other

    eess.AS cs.AI cs.SD

    Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis

    Authors: Théodor Lemerle, Harrison Vanderbyl, Vaibhav Srivastav, Nicolas Obin, Axel Roebel

    Abstract: Neural codec language models have achieved state-of-the-art performance in text-to-speech (TTS) synthesis, leveraging scalable architectures like autoregressive transformers and large-scale speech datasets. By framing voice cloning as a prompt continuation task, these models excel at cloning voices from short audio samples. However, this approach is limited in its ability to handle numerous or len… ▽ More

    Submitted 30 October, 2024; originally announced October 2024.

    Comments: Preprint

  2. arXiv:2409.10357  [pdf, other

    cs.CV cs.CL cs.LG cs.SD eess.AS

    2D or not 2D: How Does the Dimensionality of Gesture Representation Affect 3D Co-Speech Gesture Generation?

    Authors: Téo Guichoux, Laure Soulier, Nicolas Obin, Catherine Pelachaud

    Abstract: Co-speech gestures are fundamental for communication. The advent of recent deep learning techniques has facilitated the creation of lifelike, synchronous co-speech gestures for Embodied Conversational Agents. "In-the-wild" datasets, aggregating video content from platforms like YouTube via human pose detection technologies, provide a feasible solution by offering 2D skeletal sequences aligned with… ▽ More

    Submitted 27 September, 2024; v1 submitted 16 September, 2024; originally announced September 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2406.15111

  3. arXiv:2406.15111  [pdf, other

    cs.AI cs.CL cs.CV

    Investigating the impact of 2D gesture representation on co-speech gesture generation

    Authors: Teo Guichoux, Laure Soulier, Nicolas Obin, Catherine Pelachaud

    Abstract: Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection mod… ▽ More

    Submitted 24 June, 2024; v1 submitted 21 June, 2024; originally announced June 2024.

    Comments: 8 pages. Paper accepted at WACAI 2024

  4. arXiv:2406.04467  [pdf, other

    eess.AS cs.CL cs.SD

    Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis

    Authors: Théodor Lemerle, Nicolas Obin, Axel Roebel

    Abstract: Recent advancements in text-to-speech (TTS) powered by language models have showcased remarkable capabilities in achieving naturalness and zero-shot voice cloning. Notably, the decoder-only transformer is the prominent architecture in this domain. However, transformers face challenges stemming from their quadratic complexity in sequence length, impeding training on lengthy sequences and resource-c… ▽ More

    Submitted 11 June, 2024; v1 submitted 6 June, 2024; originally announced June 2024.

    Comments: Interspeech

  5. arXiv:2311.05481  [pdf, other

    cs.AI

    META4: Semantically-Aligned Generation of Metaphoric Gestures Using Self-Supervised Text and Speech Representation

    Authors: Mireille Fares, Catherine Pelachaud, Nicolas Obin

    Abstract: Image Schemas are repetitive cognitive patterns that influence the way we conceptualize and reason about various concepts present in speech. These patterns are deeply embedded within our cognitive processes and are reflected in our bodily expressions including gestures. Particularly, metaphoric gestures possess essential characteristics and semantic meanings that align with Image Schemas, to visua… ▽ More

    Submitted 21 November, 2023; v1 submitted 9 November, 2023; originally announced November 2023.

  6. arXiv:2309.02592  [pdf, other

    eess.AS cs.SD

    BWSNet: Automatic Perceptual Assessment of Audio Signals

    Authors: Clément Le Moine Veillon, Victor Rosi, Pablo Arias Sarah, Léane Salais, Nicolas Obin

    Abstract: This paper introduces BWSNet, a model that can be trained from raw human judgements obtained through a Best-Worst scaling (BWS) experiment. It maps sound samples into an embedded space that represents the perception of a studied attribute. To this end, we propose a set of cost functions and constraints, interpreting trial-wise ordinal relations as distance comparisons in a metric learning task. We… ▽ More

    Submitted 21 January, 2024; v1 submitted 5 September, 2023; originally announced September 2023.

  7. arXiv:2308.10843  [pdf, other

    cs.MM cs.CV cs.LG cs.SD eess.AS

    TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

    Authors: Mireille Fares, Catherine Pelachaud, Nicolas Obin

    Abstract: This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

  8. arXiv:2305.12887  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    ZS-MSTM: Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding

    Authors: Mireille Fares, Catherine Pelachaud, Nicolas Obin

    Abstract: In this study, we address the importance of modeling behavior style in virtual agents for personalized human-agent interaction. We propose a machine learning approach to synthesize gestures, driven by prosodic features and text, in the style of different speakers, even those unseen during training. Our model incorporates zero-shot multimodal style transfer using multimodal data from the PATS datab… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2208.01917

  9. arXiv:2208.01917  [pdf, other

    cs.SD cs.HC cs.LG eess.AS

    Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding

    Authors: Mireille Fares, Michele Grimaldi, Catherine Pelachaud, Nicolas Obin

    Abstract: Modeling virtual agents with behavior style is one factor for personalizing human agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS datab… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

  10. arXiv:2110.03744  [pdf, other

    cs.SD eess.AS

    Voice Reenactment with F0 and timing constraints and adversarial learning of conversions

    Authors: Frederik Bous, Laurent Benaroya, Nicolas Obin, Axel Roebel

    Abstract: This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conv… ▽ More

    Submitted 31 May, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: text overlap with arXiv:2107.12346

  11. arXiv:2107.12346  [pdf, other

    cs.SD cs.LG eess.AS

    Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

    Authors: Laurent Benaroya, Nicolas Obin, Axel Roebel

    Abstract: Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity and prese… ▽ More

    Submitted 27 July, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

  12. arXiv:2104.07288  [pdf, other

    eess.AS cs.LG cs.SD

    Speaker Attentive Speech Emotion Recognition

    Authors: Clément Le Moine, Nicolas Obin, Axel Roebel

    Abstract: Speech Emotion Recognition (SER) task has known significant improvements over the last years with the advent of Deep Neural Networks (DNNs). However, even the most successful methods are still rather failing when adaptation to specific speakers and scenarios is needed, inevitably leading to poorer performances when compared to humans. In this paper, we present novel work based on the idea of teach… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

  13. arXiv:2104.07283  [pdf, other

    eess.AS cs.LG cs.SD

    Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels

    Authors: Clément Le Moine Veillon, Nicolas Obin, Axel Roebel

    Abstract: This paper presents a end-to-end framework for the F0 transformation in the context of expressive voice conversion. A single neural network is proposed, in which a first module is used to learn F0 representation over different temporal scales and a second adversarial module is used to learn the transformation from one emotion to another. The first module is composed of a convolution layer with wav… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

  14. arXiv:1910.12614  [pdf, other

    eess.AS cs.LG cs.SD

    CycleGAN Voice Conversion of Spectral Envelopes using Adversarial Weights

    Authors: Rafael Ferro, Nicolas Obin, Axel Roebel

    Abstract: This paper tackles GAN optimization and stability issues in the context of voice conversion. First, to simplify the conversion task, we propose to use spectral envelopes as inputs. Second we propose two adversarial weight training paradigms, the generalized weighted GAN and the generator impact GAN, both aim at reducing the impact of the generator on the discriminator, so both can learn more gradu… ▽ More

    Submitted 11 July, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: 5 pages, 1 figure