Skip to main content

Showing 1–7 of 7 results for author: Raitio, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2203.10637  [pdf, other

    eess.AS cs.SD

    Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

    Authors: Tuomo Raitio, Petko Petkov, Jiangchuan Li, Muhammed Shifas, Andrea Davis, Yannis Stylianou

    Abstract: We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing the spectral tilt paramete… ▽ More

    Submitted 28 March, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

    Comments: 5 pages, 5 figures. Submitted to Interspeech 2022, revision includes more data in results and improved text

  2. arXiv:2110.03012  [pdf, other

    eess.AS cs.CL

    Emphasis control for parallel neural TTS

    Authors: Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li

    Abstract: Recent parallel neural text-to-speech (TTS) synthesis methods are able to generate speech with high fidelity while maintaining high performance. However, these systems often lack control over the output prosody, thus restricting the semantic information conveyable for a given text. This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent spac… ▽ More

    Submitted 29 March, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: 5 pages, 5 figures, submitted to Interspeech 2022

  3. arXiv:2110.02952  [pdf, other

    eess.AS cs.CL

    Hierarchical prosody modeling and control in non-autoregressive parallel neural TTS

    Authors: Tuomo Raitio, Jiangchuan Li, Shreyas Seshadri

    Abstract: Neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the synthetic speech often represents the average prosodic style of the database instead of having more versatile prosodic variation. Moreover, many models lack the ability to control the output prosody, which does not allow for different styles for the same text input. In this work, w… ▽ More

    Submitted 22 March, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: 5 pages, 5 figures, preprint accepted to ICASSP 2022. arXiv admin note: text overlap with arXiv:2009.06775

  4. arXiv:2109.08710  [pdf

    eess.AS cs.CL cs.PF cs.SD

    On-device neural speech synthesis

    Authors: Sivanand Achanta, Albert Antony, Ladan Golipour, Jiangchuan Li, Tuomo Raitio, Ramya Rasipuram, Francesco Rossi, Jennifer Shi, Jaimin Upadhyay, David Winarsky, Hepeng Zhang

    Abstract: Recent advances in text-to-speech (TTS) synthesis, such as Tacotron and WaveRNN, have made it possible to construct a fully neural network based TTS system, by coupling the two components together. Such a system is conceptually simple as it only takes grapheme or phoneme input, uses Mel-spectrogram as an intermediate feature, and directly generates speech samples. The system achieves quality equal… ▽ More

    Submitted 17 September, 2021; originally announced September 2021.

    Comments: 7 pages 2 figures, accepted to ASRU 2021

  5. arXiv:2101.05313  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Whispered and Lombard Neural Speech Synthesis

    Authors: Qiong Hu, Tobias Bleisch, Petko Petkov, Tuomo Raitio, Erik Marchi, Varun Lakshminarasimhan

    Abstract: It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1)… ▽ More

    Submitted 13 January, 2021; originally announced January 2021.

    Comments: To appear in SLT 2021

  6. arXiv:2009.06775  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Controllable neural text-to-speech synthesis using intuitive prosodic features

    Authors: Tuomo Raitio, Ramya Rasipuram, Dan Castellani

    Abstract: Modern neural text-to-speech (TTS) synthesis can generate speech that is indistinguishable from natural speech. However, the prosody of generated utterances often represents the average prosodic style of the database instead of having wide prosodic variation. Moreover, the generated prosody is solely defined by the input text, which does not allow for different styles for the same sentence. In thi… ▽ More

    Submitted 14 September, 2020; originally announced September 2020.

    Comments: Accepted for publication in Interspeech 2020

  7. arXiv:2006.04142  [pdf, other

    eess.AS cs.CL cs.SD

    Parametric Representation for Singing Voice Synthesis: a Comparative Evaluation

    Authors: Onur Babacan, Thomas Drugman, Tuomo Raitio, Daniel Erro, Thierry Dutoit

    Abstract: Various parametric representations have been proposed to model the speech signal. While the performance of such vocoders is well-known in the context of speech processing, their extrapolation to singing voice synthesis might not be straightforward. The goal of this paper is twofold. First, a comparative subjective evaluation is performed across four existing techniques suitable for statistical par… ▽ More

    Submitted 7 June, 2020; originally announced June 2020.