Skip to main content

Showing 1–9 of 9 results for author: Klimkov, V

Searching in archive eess. Search in all archives.
.
  1. arXiv:2306.11662  [pdf, other

    eess.AS

    Expressive Machine Dubbing Through Phrase-level Cross-lingual Prosody Transfer

    Authors: Jakub Swiatkowski, Duo Wang, Mikolaj Babianski, Giuseppe Coccia, Patrick Lumban Tobing, Ravichander Vipperla, Viacheslav Klimkov, Vincent Pollet

    Abstract: Speech generation for machine dubbing adds complexity to conventional Text-To-Speech solutions as the generated output is required to match the expressiveness, emotion and speaking rate of the source content. Capturing and transferring details and variations in prosody is a challenge. We introduce phrase-level cross-lingual prosody transfer for expressive multi-lingual machine dubbing. The propose… ▽ More

    Submitted 21 June, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  2. On granularity of prosodic representations in expressive text-to-speech

    Authors: Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz, Daniel Korzekwa, Viacheslav Klimkov

    Abstract: In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonet… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Accepted to IEEE SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 892-899

  3. arXiv:2108.06270  [pdf, other

    eess.AS cs.AI

    Enhancing audio quality for expressive Neural Text-to-Speech

    Authors: Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

    Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio a… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: 6 pages, 4 figures, 2 tables, SSW 2021

  4. Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

    Authors: Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote

    Abstract: This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021, 5 pages,3 figures

  5. arXiv:2102.01106  [pdf, other

    eess.AS cs.CL cs.SD

    Universal Neural Vocoding with Parallel WaveNet

    Authors: Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov

    Abstract: We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We… ▽ More

    Submitted 15 February, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

    Comments: 5 pages, 2 figures. Accepted to ICASSP 2021

  6. CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech

    Authors: Sri Karlapati, Alexis Moinet, Arnaud Joly, Viacheslav Klimkov, Daniel Sáez-Trigueros, Thomas Drugman

    Abstract: Prosody Transfer (PT) is a technique that aims to use the prosody from a source audio as a reference while synthesising speech. Fine-grained PT aims at capturing prosodic aspects like rhythm, emphasis, melody, duration, and loudness, from a source audio at a very granular level and transferring them when synthesising speech in a different target speaker's voice. Current approaches for fine-grained… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

    Journal ref: INTERSPEECH 2020: 4387-4391

  7. arXiv:1907.02479  [pdf, other

    eess.AS cs.CL

    Fine-grained robust prosody transfer for single-speaker neural text-to-speech

    Authors: Viacheslav Klimkov, Srikanth Ronanki, Jonas Rohnke, Thomas Drugman

    Abstract: We present a neural text-to-speech system for fine-grained prosody transfer from one speaker to another. Conventional approaches for end-to-end prosody transfer typically use either fixed-dimensional or variable-length prosody embedding via a secondary attention to encode the reference signal. However, when trained on a single-speaker dataset, the conventional prosody transfer systems are not robu… ▽ More

    Submitted 4 July, 2019; originally announced July 2019.

    Comments: 5 pages, 7 figures, Accepted for Interspeech 2019

  8. arXiv:1903.01290  [pdf, other

    cs.SD cs.CL eess.AS

    Traditional Machine Learning for Pitch Detection

    Authors: Thomas Drugman, Goeric Huybrechts, Viacheslav Klimkov, Alexis Moinet

    Abstract: Pitch detection is a fundamental problem in speech processing as F0 is used in a large number of applications. Recent articles have proposed deep learning for robust pitch tracking. In this paper, we consider voicing detection as a classification problem and F0 contour estimation as a regression problem. For both tasks, acoustic features from multiple domains and traditional machine learning metho… ▽ More

    Submitted 4 March, 2019; originally announced March 2019.

    Journal ref: IEEE Signal Processing Letters, Vol. 25, Issue 11, pp. 1745-1749, 2018

  9. arXiv:1811.06296  [pdf, other

    eess.AS cs.SD

    Comprehensive evaluation of statistical speech waveform synthesis

    Authors: Thomas Merritt, Bartosz Putrycz, Adam Nadolski, Tianjun Ye, Daniel Korzekwa, Wiktor Dolecki, Thomas Drugman, Viacheslav Klimkov, Alexis Moinet, Andrew Breen, Rafal Kuklinski, Nikko Strom, Roberto Barra-Chicote

    Abstract: Statistical TTS systems that directly predict the speech waveform have recently reported improvements in synthesis quality. This investigation evaluates Amazon's statistical speech waveform synthesis (SSWS) system. An in-depth evaluation of SSWS is conducted across a number of domains to better understand the consistency in quality. The results of this evaluation are validated by repeating the pro… ▽ More

    Submitted 11 December, 2018; v1 submitted 15 November, 2018; originally announced November 2018.