Search | arXiv e-print repository

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Authors: Mateusz Łajszczak, Guillermo Cámbara, Yang Li, Fatih Beyhan, Arent van Korlaar, Fan Yang, Arnaud Joly, Álvaro Martín-Cortinas, Ammar Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszyńska, Haohan Guo, Bartosz Putrycz, Soledad López Gambino, Kayeon Yoo, Elena Sokolova, Thomas Drugman

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts ra… ▽ More We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/. △ Less

Submitted 15 February, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

Comments: v1.1 (fixed typos)

arXiv:2309.01576 [pdf, other]

A Comparative Analysis of Pretrained Language Models for Text-to-Speech

Authors: Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman

Abstract: State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS t… ▽ More State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS tasks: prosody prediction and pause prediction. Firstly, we trained a prosody prediction model using 15 different PLMs. Our findings revealed a logarithmic relationship between model size and quality, as well as significant performance differences between neutral and expressive prosody. Secondly, we employed PLMs for pause prediction and found that the task was less sensitive to small models. We also identified a strong correlation between our empirical results and the GLUE scores obtained for these language models. To the best of our knowledge, this is the first study of its kind to investigate the impact of different PLMs on TTS. △ Less

Submitted 4 September, 2023; originally announced September 2023.

Comments: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop (SSW) in Grenoble, France, from 26th to 28th August 2023

arXiv:2307.07062 [pdf, other]

Controllable Emphasis with zero data for text-to-speech

Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3\%$ and correct testers' identification of the emphasized word in a sentence by $40\%$ on a reference female en-US voice. We show that this technique significantly closes the gap to methods that require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speaking styles. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

arXiv:2306.11327 [pdf, other]

eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

Authors: Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

Abstract: We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion… ▽ More We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion from speech. In Stage II, we learn to predict the prosody representations using the contextual information available in text. We compare eCat to CopyCat2, a model capable of both fine-grained prosody transfer (FPT) and multi-speaker TTS. We show that eCat statistically significantly reduces the gap in naturalness between CopyCat2 and human recordings by an average of 46.7% across 2 languages, 3 locales, and 7 speakers, along with better target-speaker similarity in FPT. We also compare eCat to VITS, and show a statistically significant preference. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: Accepted to be published in the Proceedings of InterSpeech 2023

arXiv:2206.14643 [pdf, other]

Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

Authors: Peter Makarov, Ammar Abbas, Mateusz Łajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

Abstract: Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m… ▽ More Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on multi-speaker data all improve prosody. More interestingly, they result in synergies. Long context disambiguates prosody, improves coherence, and plays to the strengths of Transformers. Fine-tuning word-level features from a powerful language model, such as BERT, appears to profit from more training data, readily available in a multi-speaker setting. We look into objective metrics on pausing and pacing and perform thorough subjective evaluations for speech naturalness. Our main system, which incorporates all the extensions, achieves consistently strong results, including statistically significant improvements in speech naturalness over all its competitors. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2206.14165 [pdf, other]

Expressive, Variable, and Controllable Duration Modelling in TTS

Authors: Ammar Abbas, Thomas Merritt, Alexis Moinet, Sri Karlapati, Ewa Muszynska, Simon Slangen, Elia Gatti, Thomas Drugman

Abstract: Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve… ▽ More Duration modelling has become an important research problem once more with the rise of non-attention neural text-to-speech systems. The current approaches largely fall back to relying on previous statistical parametric speech synthesis technology for duration prediction, which poorly models the expressiveness and variability in speech. In this paper, we propose two alternate approaches to improve duration modelling. First, we propose a duration model conditioned on phrasing that improves the predicted durations and provides better modelling of pauses. We show that the duration model conditioned on phrasing improves the naturalness of speech over our baseline duration model. Second, we also propose a multi-speaker duration model called Cauliflow, that uses normalising flows to predict durations that better match the complex target duration distribution. Cauliflow performs on par with our other proposed duration model in terms of naturalness, whilst providing variable durations for the same prompt and variable levels of expressiveness. Lastly, we propose to condition Cauliflow on parameters that provide an intuitive control of the pacing and pausing in the synthesised speech in a novel way. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2206.13443 [pdf, other]

CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

Authors: Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

Abstract: In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel appro… ▽ More In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel approach to two-stage training. In Stage I, the model learns speaker-independent word-level prosody representations from speech which it uses for many-to-many fine-grained prosody transfer. In Stage II, we learn to predict these prosody representations using the contextual information available in text, thereby, enabling multi-speaker TTS with contextually appropriate prosody. We compare CC2 to two strong baselines, one in TTS with contextually appropriate prosody, and one in fine-grained prosody transfer. CC2 reduces the gap in naturalness between our baseline and copy-synthesised speech by $22.79\%$. In fine-grained prosody transfer evaluations, it obtains a relative improvement of $33.15\%$ in target speaker similarity. △ Less

Submitted 27 June, 2022; originally announced June 2022.

Comments: Accepted to be published in the Proceedings of InterSpeech 2022

arXiv:2202.06409 [pdf, other]

Distribution augmentation for low-resource expressive text-to-speech

Authors: Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

Abstract: This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w… ▽ More This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a way that preserves syntactical correctness. We take additional measures to ensure that synthesized speech does not contain artifacts caused by combining inconsistent audio samples. The perceptual evaluations show that our method improves speech quality over a number of datasets, speakers, and TTS architectures. We also demonstrate that it greatly improves robustness of attention-based TTS models. △ Less

Submitted 19 February, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

Comments: ICASSP 2022: camera-ready

arXiv:2106.15649 [pdf, other]

Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Authors: Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

Abstract: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me… ▽ More We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices. △ Less

Submitted 29 June, 2021; originally announced June 2021.

Comments: Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

arXiv:2106.10229 [pdf, other]

A learned conditional prior for the VAE acoustic space of a TTS system

Authors: Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

Abstract: Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for… ▽ More Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system. △ Less

Submitted 14 June, 2021; originally announced June 2021.

Comments: in Proceedings of Interspeech 2021

arXiv:2012.09703 [pdf, other]

Parallel WaveNet conditioned on VAE latent vectors

Authors: Jonas Rohnke, Tom Merritt, Jaime Lorenzo-Trueba, Adam Gabrys, Vatsal Aggarwal, Alexis Moinet, Roberto Barra-Chicote

Abstract: Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with… ▽ More Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with natural speech recordings. However, the inference speed of neural vocoder approaches represents a major obstacle for deploying this technology for commercial applications. Parallel WaveNet is one approach which has been developed to address this issue, trading off some synthesis quality for significantly faster inference speed. In this paper we investigate the use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder. We condition the neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model. With this, we are able to significantly improve the quality of vocoded speech. △ Less

Submitted 17 December, 2020; originally announced December 2020.

arXiv:2011.02252 [pdf, other]

Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

Authors: Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

Abstract: In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information ava… ▽ More In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text. To do this, we use BERT on text, and graph-attention networks on parse trees extracted from text. We show a statistically significant relative improvement of $13.2\%$ in naturalness over a strong baseline when compared to recordings. We also conduct an ablation study on variations of our sampling technique, and show a statistically significant improvement over the baseline in each case. △ Less

Submitted 4 November, 2020; originally announced November 2020.

Comments: 5 pages and 3 figures

arXiv:2011.01175 [pdf, other]

CAMP: a Two-Stage Approach to Modelling Prosody in Context

Authors: Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman

Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In th… ▽ More Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a context-dependent prior over our prosodic space. Our Context-Aware Model of Prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly. △ Less

Submitted 12 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

Comments: 5 pages. Published in the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

arXiv:2005.11682 [pdf, other]

Glottal source estimation robustness: A comparison of sensitivity of voice source estimation techniques

Authors: Thomas Drugman, Thomas Dubuisson, Alexis Moinet, Nicolas D'Alessandro, Thierry Dutoit

Abstract: This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decompositi… ▽ More This paper addresses the problem of estimating the voice source directly from speech waveforms. A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase. This technique is compared to two other state-of-the-art well-known methods, namely the Zeros of the Z-Transform (ZZT) and the Iterative Adaptive Inverse Filtering (IAIF) algorithms. Decomposition quality is assessed on synthetic signals through two objective measures: the spectral distortion and a glottal formant determination rate. Technique robustness is tested by analyzing the influence of noise and Glottal Closure Instant (GCI) location errors. Besides impacts of the fundamental frequency and the first formant on the performance are evaluated. Our proposed approach shows significant improvement in robustness, which could be of a great interest when decomposing real speech. △ Less

Submitted 24 May, 2020; originally announced May 2020.

arXiv:2004.14617 [pdf, other]

doi 10.21437/Interspeech.2020-1251

CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

Authors: Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman

Abstract: Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained… ▽ More Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained PT suffer from source speaker leakage, where the synthesised speech has the voice identity of the source speaker as opposed to the target speaker. In order to mitigate this issue, they compromise on the quality of PT. In this paper, we propose CopyCat, a novel, many-to-many PT system that is robust to source speaker leakage, without using parallel data. We achieve this through a novel reference encoder architecture capable of capturing temporal prosodic representations which are robust to source speaker leakage. We compare CopyCat against a state-of-the-art fine-grained PT model through various subjective evaluations, where we show a relative improvement of $47\%$ in the quality of prosody transfer and $14\%$ in preserving the target speaker identity, while still maintaining the same naturalness. △ Less

Submitted 30 April, 2020; originally announced April 2020.

Journal ref: INTERSPEECH 2020: 4387-4391

arXiv:1912.12887 [pdf, other]

Using a Pitch-Synchronous Residual Codebook for Hybrid HMM/Frame Selection Speech Synthesis

Authors: Thomas Drugman, Alexis Moinet, Thierry Dutoit, Geoffrey Wilfart

Abstract: This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The… ▽ More This paper proposes a method to improve the quality delivered by statistical parametric speech synthesizers. For this, we use a codebook of pitch-synchronous residual frames, so as to construct a more realistic source signal. First a limited codebook of typical excitations is built from some training database. During the synthesis part, HMMs are used to generate filter and source coefficients. The latter coefficients contain both the pitch and a compact representation of target residual frames. The source signal is obtained by concatenating excitation frames picked up from the codebook, based on a selection criterion and taking target residual coefficients as input. Subjective results show a relevant improvement compared to the basic technique. △ Less

Submitted 30 December, 2019; originally announced December 2019.

arXiv:1912.05881 [pdf, other]

Singing Synthesis: with a little help from my attention

Authors: Orazio Angelini, Alexis Moinet, Kayoko Yanagisawa, Thomas Drugman

Abstract: We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis… ▽ More We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing synthesis field and improves naturalness over the state of the art. The system requires considerably less explicit modelling of voice features such as F0 patterns, vibratos, and note and phoneme durations, than previous models in the literature. Despite this, it shows a strong improvement in naturalness with respect to previous neural singing synthesis models. The model does not require any durations or pitch patterns as inputs, and learns to insert vibrato autonomously according to the musical context. However, we observe that, by completely dispensing with any explicit duration modelling it becomes harder to obtain the fine control of timing needed to exactly match the tempo of a song. △ Less

Submitted 6 May, 2020; v1 submitted 12 December, 2019; originally announced December 2019.

Comments: Submitted to Interspeech 2020

arXiv:1912.05289 [pdf, ps, other]

doi 10.1109/LSP.2019.2961213

Voice Conversion for Whispered Speech Synthesis

Authors: Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

Abstract: We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speak… ▽ More We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the mapping between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speaker similarity of the converted whisper on an internal corpus and on the publicly available wTIMIT corpus. We show that applying VC techniques is significantly better than using rule-based signal processing methods and it achieves results that are indistinguishable from copy-synthesis of natural whisper recordings. We investigate the ability of the DNN model to generalize on unseen speakers, when trained with data from multiple speakers. We show that excluding the target speaker from the training set has little or no impact on the perceived naturalness and speaker similarity of the converted whisper. The proposed DNN method is used in the newly released Whisper Mode of Amazon Alexa. △ Less

Submitted 17 January, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

Comments: Submitted to IEEE Signal Processing Letters

arXiv:1904.10749 [pdf, other]

Random walks in non-Poissoinan activity driven temporal networks

Authors: Antoine Moinet, Michele Starnini, Romualdo Pastor-Satorras

Abstract: The interest in non-Markovian dynamics within the complex systems community has recently blossomed, due to a new wealth of time-resolved data pointing out the bursty dynamics of many natural and human interactions, manifested in an inter-event time between consecutive interactions showing a heavy-tailed distribution. In particular, empirical data has shown that the bursty dynamics of temporal netw… ▽ More The interest in non-Markovian dynamics within the complex systems community has recently blossomed, due to a new wealth of time-resolved data pointing out the bursty dynamics of many natural and human interactions, manifested in an inter-event time between consecutive interactions showing a heavy-tailed distribution. In particular, empirical data has shown that the bursty dynamics of temporal networks can have deep consequences on the behavior of the dynamical processes running on top of them. Here, we study the case of random walks, as a paradigm of diffusive processes, unfolding on temporal networks generated by a non-Poissonian activity driven dynamics. We derive analytic expressions for the steady state occupation probability and first passage time distribution in the infinite network size and strong aging limits, showing that the random walk dynamics on non-Markovian networks are fundamentally different from what is observed in Markovian networks. We found a particularly surprising behavior in the limit of diverging average inter-event time, in which the random walker feels the network as homogeneous, even though the activation probability of nodes is heterogeneously distributed. Our results are supported by extensive numerical simulations. We anticipate that our findings may be of interest among the researchers studying non-Markovian dynamics of time-evolving complex topologies. △ Less

Submitted 24 April, 2019; originally announced April 2019.

Comments: 10 pages, 4 figures

arXiv:1903.01290 [pdf, other]

doi 10.1109/LSP.2018.2874155

Traditional Machine Learning for Pitch Detection

Authors: Thomas Drugman, Goeric Huybrechts, Viacheslav Klimkov, Alexis Moinet

Abstract: Pitch detection is a fundamental problem in speech processing as F0 is used in a large number of applications. Recent articles have proposed deep learning for robust pitch tracking. In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem. For both tasks, acoustic features from multiple domains and traditional machine learning metho… ▽ More Pitch detection is a fundamental problem in speech processing as F0 is used in a large number of applications. Recent articles have proposed deep learning for robust pitch tracking. In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem. For both tasks, acoustic features from multiple domains and traditional machine learning methods are used. The discrimination power of existing and proposed features is assessed through mutual information. Multiple supervised and unsupervised approaches are compared. A significant relative reduction of voicing errors over the best baseline is obtained: 20% with the best clustering method (K-means) and 45% with a Multi-Layer Perceptron. For F0 contour estimation, the benefits of regression techniques are limited though. We investigate whether those objective gains translate in a parametric synthesis task. Clear perceptual preferences are observed for the proposed approach over two widely-used baselines (RAPT and DIO). △ Less

Submitted 4 March, 2019; originally announced March 2019.

Journal ref: IEEE Signal Processing Letters, Vol. 25, Issue 11, pp. 1745-1749, 2018

arXiv:1811.06296 [pdf, other]

Comprehensive evaluation of statistical speech waveform synthesis

Authors: Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote

Abstract: Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the pro… ▽ More Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the procedure on a separate group of testers. Finally, an analysis of the nature of speech errors of SSWS compared to hybrid unit selection synthesis is conducted to identify the strengths and weaknesses of SSWS. Having a deeper insight into SSWS allows us to better define the focus of future work to improve this new technology. △ Less

Submitted 11 December, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

arXiv:1811.06292 [pdf, other]

Towards achieving robust universal neural vocoding

Authors: Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-d… ▽ More This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-domain scenario when the recording conditions are studio-quality. When the recordings show significant changes in quality, or when moving towards non-speech vocalizations or singing, the vocoder still significantly outperforms speaker-dependent vocoders, but operates at a lower average relative MUSHRA of 75%. These results are shown to be consistent across languages, regardless of them being seen during training (e.g. English or Japanese) or unseen (e.g. Wolof, Swahili, Ahmaric). △ Less

Submitted 4 July, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

Comments: 4 pages, 1 extra for references. Accepted on Interspeech 2019

arXiv:1804.07476 [pdf, other]

doi 10.1103/PhysRevE.98.022303

Generalized Voter-like model on activity driven networks with attractiveness

Authors: Antoine Moinet, Alain Barrat, Romualdo Pastor Satorras

Abstract: We study the behavior of a generalized consensus dynamics on a temporal network of interactions, the activity driven network with attractiveness. In this temporal network model, agents are endowed with an intrinsic activity $a$, ruling the rate at which they generate connections, and an intrinsic attractiveness $b$, modulating the rate at which they receive connections. The consensus dynamics cons… ▽ More We study the behavior of a generalized consensus dynamics on a temporal network of interactions, the activity driven network with attractiveness. In this temporal network model, agents are endowed with an intrinsic activity $a$, ruling the rate at which they generate connections, and an intrinsic attractiveness $b$, modulating the rate at which they receive connections. The consensus dynamics considered is a mixed voter/Moran dynamics. Each agent, either in state $0$ or $1$, modifies his/her state when connecting with a peer. Thus, an active agent copies his/her state from the peer (with probability $p$) or imposes his/her state to him/her (with the complementary probability $1-p$). Applying a heterogeneous mean-field approach, we derive a differential equation for the average density of voters with activity $a$ and attractiveness $b$ in state $1$, that we use to evaluate the average time to reach consensus and the exit probability, defined as the probability that a single agent with activity $a$ and attractiveness $b$ eventually imposes his/her state to a pool of initially unanimous population in the opposite state. We study a number of particular cases, finding an excellent agreement with numerical simulations of the model. Interestingly, we observe a symmetry between voter and Moran dynamics in pure activity driven networks and their static integrated counterparts that exemplifies the strong differences that a time-varying network can impose on dynamical processes. △ Less

Submitted 20 April, 2018; originally announced April 2018.

Journal ref: Phys. Rev. E 98, 022303 (2018)

arXiv:1804.00610 [pdf, other]

BATMAN : plate-forme blockchain pour l'authentification et la confiance dans les WSNs

Authors: Axel Moinet, Benoît Darties, Jean-Luc Baril

Abstract: Wireless Sensor networks (WSN) today suffer from a lack of security adapted to their multiple constraints, to which authentication and trust management solutions such as PGP only partially responds. On the one hand, the constraints of autonomy and co-operation of the nodes necessary to guarantee the coherence of the network do not require a distributed solution. On the other hand, the constraints… ▽ More Wireless Sensor networks (WSN) today suffer from a lack of security adapted to their multiple constraints, to which authentication and trust management solutions such as PGP only partially responds. On the one hand, the constraints of autonomy and co-operation of the nodes necessary to guarantee the coherence of the network do not require a distributed solution. On the other hand, the constraints of energy consumption and the low computing power of the nodes require the use of algorithms of low complexity (Zhang2014) . To our knowledge, no solution can answer both these problems at the same time. We are proposing a new solution for securing WSNs named BATMAN (Blockchain Authentication and Trust Module in Ad-hoc Networks) to reply to these challenges. We present a model of centralized management for authentication and trust, implementable on the Tezos blockchain, and evaluate through simulation the confidence estimators proposed here. △ Less

Submitted 2 April, 2018; originally announced April 2018.

Comments: 4 pages, in French, 2 figures

arXiv:1801.06349 [pdf]

Proceedings of eNTERFACE 2015 Workshop on Intelligent Interfaces

Authors: Matei Mancas, Christian Frisson, Joëlle Tilmanne, Nicolas d'Alessandro, Petr Barborka, Furkan Bayansar, Francisco Bernard, Rebecca Fiebrink, Alexis Heloir, Edgar Hemery, Sohaib Laraba, Alexis Moinet, Fabrizio Nunnari, Thierry Ravet, Loïc Reboursière, Alvaro Sarasua, Mickaël Tits, Noé Tits, François Zajéga, Paolo Alborno, Ksenia Kolykhalova, Emma Frid, Damiano Malafronte, Lisanne Huis in't Veld, Hüseyin Cakmak , et al. (49 additional authors not shown)

Abstract: The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight selected projects structured around intelligent interf… ▽ More The 11th Summer Workshop on Multimodal Interfaces eNTERFACE 2015 was hosted by the Numediart Institute of Creative Technologies of the University of Mons from August 10th to September 2015. During the four weeks, students and researchers from all over the world came together in the Numediart Institute of the University of Mons to work on eight selected projects structured around intelligent interfaces. Eight projects were selected and their reports are shown here. △ Less

Submitted 19 January, 2018; originally announced January 2018.

Comments: 159 pages

arXiv:1710.05589 [pdf, other]

doi 10.1103/PhysRevE.97.012313

Effect of risk perception on epidemic spreading in temporal networks

Authors: Antoine Moinet, Alain Barrat, Romualdo Pastor Satorras

Abstract: Many progresses in the understanding of epidemic spreading models have been obtained thanks to numerous modeling efforts and analytical and numerical studies, considering host populations with very different structures and properties, including complex and temporal interaction networks. Moreover, a number of recent studies have started to go beyond the assumption of an absence of coupling between… ▽ More Many progresses in the understanding of epidemic spreading models have been obtained thanks to numerous modeling efforts and analytical and numerical studies, considering host populations with very different structures and properties, including complex and temporal interaction networks. Moreover, a number of recent studies have started to go beyond the assumption of an absence of coupling between the spread of a disease and the structure of the contacts on which it unfolds. Models including awareness of the spread have been proposed, to mimic possible precautionary measures taken by individuals that decrease their risk of infection, but have mostly considered static networks. Here, we adapt such a framework to the more realistic case of temporal networks of interactions between individuals. We study the resulting model by analytical and numerical means on both simple models of temporal networks and empirical time-resolved contact data. Analytical results show that the epidemic threshold is not affected by the awareness but that the prevalence can be significantly decreased. Numerical studies highlight however the presence of very strong finite-size effects, in particular for the more realistic synthetic temporal networks, resulting in a significant shift of the effective epidemic threshold in the presence of risk awareness. For empirical contact networks, the awareness mechanism leads as well to a shift in the effective threshold and to a strong reduction of the epidemic prevalence. △ Less

Submitted 16 October, 2017; originally announced October 2017.

Journal ref: Phys. Rev. E 97, 012313 (2018)

arXiv:1706.01730 [pdf, other]

Blockchain based trust & authentication for decentralized sensor networks

Authors: Axel Moinet, Benoît Darties, Jean-Luc Baril

Abstract: Sensor networks and Wireless Sensor Networks (WSN) are key components for the development of the Internet of Things. These networks are subject of two kinds of constraints. Adaptability by the mean of mutability and evolutivity, and constrained node resources such as energy consumption, computational complexity or memory usage. In this context, none of the existing protocols and models allows reli… ▽ More Sensor networks and Wireless Sensor Networks (WSN) are key components for the development of the Internet of Things. These networks are subject of two kinds of constraints. Adaptability by the mean of mutability and evolutivity, and constrained node resources such as energy consumption, computational complexity or memory usage. In this context, none of the existing protocols and models allows reliable peer authentication and trust level management. In the field of virtual economic transactions, Bitcoin has proposed a new decentralized and evolutive way to model and acknowledge trust and data validity in a peer network by the mean of the blockchain. We propose a new security model and its protocol based on the blockchain technology to ensure validity and integrity of cryptographic authentication data and associate peer trust level, from the beginning to the end of the sensor network lifetime. △ Less

Submitted 6 June, 2017; originally announced June 2017.

Comments: 6 pages, double-column. Preprint version submitted to IEEE Security & Privacy, Special Issue on Blockchain

arXiv:1606.00593 [pdf, other]

doi 10.1103/PhysRevE.94.022316

Aging and percolation dynamics in a Non-Poissonian temporal network model

Authors: Antoine Moinet, Michele Starnini, Romualdo Pastor-Satorras

Abstract: We present an exhaustive mathematical analysis of the recently proposed Non-Poissonian Ac- tivity Driven (NoPAD) model [Moinet et al. Phys. Rev. Lett., 114 (2015)], a temporal network model incorporating the empirically observed bursty nature of social interactions. We focus on the aging effects emerging from the Non-Poissonian dynamics of link activation, and on their effects on the topological p… ▽ More We present an exhaustive mathematical analysis of the recently proposed Non-Poissonian Ac- tivity Driven (NoPAD) model [Moinet et al. Phys. Rev. Lett., 114 (2015)], a temporal network model incorporating the empirically observed bursty nature of social interactions. We focus on the aging effects emerging from the Non-Poissonian dynamics of link activation, and on their effects on the topological properties of time-integrated networks, such as the degree distribution. Analytic expressions for the degree distribution of integrated networks as a function of time are derived, ex- ploring both limits of vanishing and strong aging. We also address the percolation process occurring on these temporal networks, by computing the threshold for the emergence of a giant connected component, highlighting the aging dependence. Our analytic predictions are checked by means of extensive numerical simulations of the NoPAD model. △ Less

Submitted 2 June, 2016; originally announced June 2016.

arXiv:1412.0587 [pdf, other]

doi 10.1103/PhysRevLett.114.108701

Burstiness and aging in social temporal networks

Authors: Antoine Moinet, Michele Starnini, Romualdo Pastor-Satorras

Abstract: The presence of burstiness in temporal social networks, revealed by a power law form of the waiting time distribution of consecutive interactions, is expected to produce aging effects in the corresponding time-integrated network. Here we propose an analytically tractable model, in which interactions among the agents are ruled by a renewal process, and that is able to reproduce this aging behavior.… ▽ More The presence of burstiness in temporal social networks, revealed by a power law form of the waiting time distribution of consecutive interactions, is expected to produce aging effects in the corresponding time-integrated network. Here we propose an analytically tractable model, in which interactions among the agents are ruled by a renewal process, and that is able to reproduce this aging behavior. We develop an analytic solution for the topological properties of the integrated network produced by the model, finding that the time translation invariance of the degree distribution is broken. We validate our predictions against numerical simulations, and we check for the presence of aging effects in a empirical temporal network, ruled by bursty social interactions. △ Less

Submitted 27 November, 2014; originally announced December 2014.

Journal ref: Phys. Rev. Lett. 114, 108701 (2015)

Showing 1–29 of 29 results for author: Moinet, A