Skip to main content

Showing 1–7 of 7 results for author: Pokora, K

Searching in archive cs. Search in all archives.
.
  1. Creating New Voices using Normalizing Flows

    Authors: Piotr Bilinski, Thomas Merritt, Abdelhamid Ezzerg, Kamil Pokora, Sebastian Cygert, Kayoko Yanagisawa, Roberto Barra-Chicote, Daniel Korzekwa

    Abstract: Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities. First… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: Interspeech 2022

    Journal ref: Interspeech 2022, 2958-2962

  2. arXiv:2309.08255  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for Robust Polyglot Text-To-Speech

    Authors: Dariusz Piotrowski, Renard Korzeniowski, Alessio Falai, Sebastian Cygert, Kamil Pokora, Georgi Tinchev, Ziyao Zhang, Kayoko Yanagisawa

    Abstract: In this work, we introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model. The proposed framework consists of 4 stages. In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker. In the third stage, the converted data is combined with th… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICONIP 2023

  3. arXiv:2307.16679  [pdf, other

    eess.AS cs.CL cs.LG

    Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

    Authors: Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba

    Abstract: Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosod… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, 5 tables. Interspeech 2023

  4. On granularity of prosodic representations in expressive text-to-speech

    Authors: Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafal Sienkiewicz, Daniel Korzekwa, Viacheslav Klimkov

    Abstract: In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonet… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

    Comments: Accepted to IEEE SLT 2022

    Journal ref: 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 892-899

  5. arXiv:2203.08009  [pdf, other

    eess.AS cs.SD

    Text-free non-parallel many-to-many voice conversion using normalising flows

    Authors: Thomas Merritt, Abdelhamid Ezzerg, Piotr BiliƄski, Magdalena Proszewska, Kamil Pokora, Roberto Barra-Chicote, Daniel Korzekwa

    Abstract: Non-parallel voice conversion (VC) is typically achieved using lossy representations of the source speech. However, ensuring only speaker identity information is dropped whilst all other information from the source speech is retained is a large challenge. This is particularly challenging in the scenario where at inference-time we have no knowledge of the text being read, i.e., text-free VC. To mit… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

  6. arXiv:2108.06270  [pdf, other

    eess.AS cs.AI

    Enhancing audio quality for expressive Neural Text-to-Speech

    Authors: Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

    Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio a… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: 6 pages, 4 figures, 2 tables, SSW 2021

  7. arXiv:2106.12896  [pdf, other

    cs.SD cs.AI cs.LG

    Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

    Authors: Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa, Thomas Merritt

    Abstract: Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly express… ▽ More

    Submitted 25 June, 2021; v1 submitted 24 June, 2021; originally announced June 2021.

    Comments: 6 pages, 5 figures. Accepted to Speech Synthesis Workshop (SSW) 2021