Skip to main content

Showing 1–29 of 29 results for author: Moinet, A

.
  1. arXiv:2402.08093  [pdf, other

    cs.LG cs.CL eess.AS

    BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

    Authors: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

    Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra… ▽ More

    Submitted 15 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: v1.1 (fixed typos)

  2. arXiv:2309.01576  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Analysis of Pretrained Language Models for Text-to-Speech

    Authors: Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman

    Abstract: State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS t… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop (SSW) in Grenoble, France, from 26th to 28th August 2023

  3. arXiv:2307.07062  [pdf, other

    eess.AS cs.LG cs.SD

    Controllable Emphasis with zero data for text-to-speech

    Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

    Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

  4. arXiv:2306.11327  [pdf, other

    eess.AS cs.SD

    eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

    Authors: Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

    Abstract: We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2023

  5. arXiv:2206.14643  [pdf, other

    eess.AS cs.CL

    Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

    Authors: Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

    Abstract: Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  6. arXiv:2206.14165  [pdf, other

    eess.AS cs.SD

    Expressive, Variable, and Controllable Duration Modelling in TTS

    Authors: Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

    Abstract: Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve… ▽ More

    Submitted 28 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  7. arXiv:2206.13443  [pdf, other

    eess.AS cs.SD

    CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

    Authors: Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

    Abstract: In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel appro… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  8. arXiv:2202.06409  [pdf, other

    eess.AS cs.CL cs.LG

    Distribution augmentation for low-resource expressive text-to-speech

    Authors: Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

    Abstract: This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w… ▽ More

    Submitted 19 February, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: ICASSP 2022: camera-ready

  9. arXiv:2106.15649  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

    Authors: Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

    Abstract: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    Comments: Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

  10. arXiv:2106.10229  [pdf, other

    eess.AS cs.LG cs.SD

    A learned conditional prior for the VAE acoustic space of a TTS system

    Authors: Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

    Abstract: Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: in Proceedings of Interspeech 2021

  11. arXiv:2012.09703  [pdf, other

    eess.AS cs.SD

    Parallel WaveNet conditioned on VAE latent vectors

    Authors: Jonas Rohnke, Tom Merritt, Jaime Lorenzo-Trueba, Adam Gabrys, Vatsal Aggarwal, Alexis Moinet, Roberto Barra-Chicote

    Abstract: Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with… ▽ More

    Submitted 17 December, 2020; originally announced December 2020.

  12. arXiv:2011.02252  [pdf, other

    eess.AS cs.CL cs.SD

    Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

    Authors: Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

    Abstract: In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information ava… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: 5 pages and 3 figures

  13. arXiv:2011.01175  [pdf, other

    eess.AS

    CAMP: a Two-Stage Approach to Modelling Prosody in Context

    Authors: Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman

    Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In th… ▽ More

    Submitted 12 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: 5 pages. Published in the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

  14. arXiv:2005.11682  [pdf, other

    eess.AS cs.CL cs.SD

    Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques

    Authors: Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D'Alessandro, Thierry Dutoit

    Abstract: This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decompositi… ▽ More

    Submitted 24 May, 2020; originally announced May 2020.

  15. CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

    Authors: Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman

    Abstract: Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

    Journal ref: INTERSPEECH 2020: 4387-4391

  16. arXiv:1912.12887  [pdf, other

    cs.SD cs.CL eess.AS

    Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis

    Authors: Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart

    Abstract: This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The… ▽ More

    Submitted 30 December, 2019; originally announced December 2019.

  17. arXiv:1912.05881  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Singing Synthesis: with a little help from my attention

    Authors: Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

    Abstract: We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis… ▽ More

    Submitted 6 May, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

    Comments: Submitted to Interspeech 2020

  18. arXiv:1912.05289  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Voice Conversion for Whispered Speech Synthesis

    Authors: Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

    Abstract: We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speak… ▽ More

    Submitted 17 January, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    Comments: Submitted to IEEE Signal Processing Letters

  19. arXiv:1904.10749  [pdf, other

    cond-mat.stat-mech physics.soc-ph

    Random walks in non-Poissoinan activity driven temporal networks

    Authors: Antoine Moinet, Michele Starnini, Romualdo Pastor-Satorras

    Abstract: The interest in non-Markovian dynamics within the complex systems community has recently blossomed, due to a new wealth of time-resolved data pointing out the bursty dynamics of many natural and human interactions, manifested in an inter-event time between consecutive interactions showing a heavy-tailed distribution. In particular, empirical data has shown that the bursty dynamics of temporal netw… ▽ More

    Submitted 24 April, 2019; originally announced April 2019.

    Comments: 10 pages, 4 figures

  20. arXiv:1903.01290  [pdf, other

    cs.SD cs.CL eess.AS

    Traditional Machine Learning for Pitch Detection

    Authors: Thomas Drugman, Goeric Huybrechts, Viacheslav Klimkov, Alexis Moinet

    Abstract: Pitch detection is a fundamental problem in speech processing as F0 is used in a large number of applications. Recent articles have proposed deep learning for robust pitch tracking. In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem. For both tasks, acoustic features from multiple domains and traditional machine learning metho… ▽ More

    Submitted 4 March, 2019; originally announced March 2019.

    Journal ref: IEEE Signal Processing Letters, Vol. 25, Issue 11, pp. 1745-1749, 2018

  21. arXiv:1811.06296  [pdf, other

    eess.AS cs.SD

    Comprehensive evaluation of statistical speech waveform synthesis

    Authors: Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote

    Abstract: Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the pro… ▽ More

    Submitted 11 December, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

  22. arXiv:1811.06292  [pdf, other

    eess.AS cs.SD

    Towards achieving robust universal neural vocoding

    Authors: Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

    Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-d… ▽ More

    Submitted 4 July, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: 4 pages, 1 extra for references. Accepted on Interspeech 2019

  23. Generalized Voter-like model on activity driven networks with attractiveness

    Authors: Antoine Moinet, Alain Barrat, Romualdo Pastor Satorras

    Abstract: We study the behavior of a generalized consensus dynamics on a temporal network of interactions, the activity driven network with attractiveness. In this temporal network model, agents are endowed with an intrinsic activity $a$, ruling the rate at which they generate connections, and an intrinsic attractiveness $b$, modulating the rate at which they receive connections. The consensus dynamics cons… ▽ More

    Submitted 20 April, 2018; originally announced April 2018.

    Journal ref: Phys. Rev. E 98, 022303 (2018)

  24. arXiv:1804.00610  [pdf, other

    cs.CR

    BATMAN : plate-forme blockchain pour l'authentification et la confiance dans les WSNs

    Authors: Axel Moinet, Benoît Darties, Jean-Luc Baril

    Abstract: Wireless Sensor networks (WSN) today suffer from a lack of security adapted to their multiple constraints, to which authentication and trust management solutions such as PGP only partially responds. On the one hand, the constraints of autonomy and co-operation of the nodes necessary to guarantee the coherence of the network do not require a distributed solution. On the other hand, the constraints… ▽ More

    Submitted 2 April, 2018; originally announced April 2018.

    Comments: 4 pages, in French, 2 figures

  25. arXiv:1801.06349  [pdf

    cs.HC cs.AI cs.CV

    Proceedings of eNTERFACE 2015 Workshop on Intelligent Interfaces

    Authors: Matei Mancas, Christian Frisson, Joëlle Tilmanne, Nicolas d'Alessandro, Petr Barborka, Furkan Bayansar, Francisco Bernard, Rebecca Fiebrink, Alexis Heloir, Edgar Hemery, Sohaib Laraba, Alexis Moinet, Fabrizio Nunnari, Thierry Ravet, Loïc Reboursière, Alvaro Sarasua, Mickaël Tits, Noé Tits, François Zajéga, Paolo Alborno, Ksenia Kolykhalova, Emma Frid, Damiano Malafronte, Lisanne Huis in't Veld, Hüseyin Cakmak , et al. (49 additional authors not shown)

    Abstract: The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight selected projects structured around intelligent interf… ▽ More

    Submitted 19 January, 2018; originally announced January 2018.

    Comments: 159 pages

  26. Effect of risk perception on epidemic spreading in temporal networks

    Authors: Antoine Moinet, Alain Barrat, Romualdo Pastor Satorras

    Abstract: Many progresses in the understanding of epidemic spreading models have been obtained thanks to numerous modeling efforts and analytical and numerical studies, considering host populations with very different structures and properties, including complex and temporal interaction networks. Moreover, a number of recent studies have started to go beyond the assumption of an absence of coupling between… ▽ More

    Submitted 16 October, 2017; originally announced October 2017.

    Journal ref: Phys. Rev. E 97, 012313 (2018)

  27. arXiv:1706.01730  [pdf, other

    cs.CR cs.DC

    Blockchain based trust & authentication for decentralized sensor networks

    Authors: Axel Moinet, Benoît Darties, Jean-Luc Baril

    Abstract: Sensor networks and Wireless Sensor Networks (WSN) are key components for the development of the Internet of Things. These networks are subject of two kinds of constraints. Adaptability by the mean of mutability and evolutivity, and constrained node resources such as energy consumption, computational complexity or memory usage. In this context, none of the existing protocols and models allows reli… ▽ More

    Submitted 6 June, 2017; originally announced June 2017.

    Comments: 6 pages, double-column. Preprint version submitted to IEEE Security & Privacy, Special Issue on Blockchain

  28. arXiv:1606.00593  [pdf, other

    cond-mat.dis-nn physics.soc-ph

    Aging and percolation dynamics in a Non-Poissonian temporal network model

    Authors: Antoine Moinet, Michele Starnini, Romualdo Pastor-Satorras

    Abstract: We present an exhaustive mathematical analysis of the recently proposed Non-Poissonian Ac- tivity Driven (NoPAD) model [Moinet et al. Phys. Rev. Lett., 114 (2015)], a temporal network model incorporating the empirically observed bursty nature of social interactions. We focus on the aging effects emerging from the Non-Poissonian dynamics of link activation, and on their effects on the topological p… ▽ More

    Submitted 2 June, 2016; originally announced June 2016.

  29. arXiv:1412.0587  [pdf, other

    physics.soc-ph cond-mat.stat-mech cs.SI

    Burstiness and aging in social temporal networks

    Authors: Antoine Moinet, Michele Starnini, Romualdo Pastor-Satorras

    Abstract: The presence of burstiness in temporal social networks, revealed by a power law form of the waiting time distribution of consecutive interactions, is expected to produce aging effects in the corresponding time-integrated network. Here we propose an analytically tractable model, in which interactions among the agents are ruled by a renewal process, and that is able to reproduce this aging behavior.… ▽ More

    Submitted 27 November, 2014; originally announced December 2014.

    Journal ref: Phys. Rev. Lett. 114, 108701 (2015)