Skip to main content

Showing 1–10 of 10 results for author: Karanasou, P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.11985  [pdf, other

    cs.CL

    Speak & Improve Challenge 2025: Tasks and Baseline Systems

    Authors: Mengjie Qian, Kate Knill, Stefano Banno, Siyuan Tang, Penny Karanasou, Mark J. F. Gales, Diane Nicholls

    Abstract: This paper presents the "Speak & Improve Challenge 2025: Spoken Language Assessment and Feedback" -- a challenge associated with the ISCA SLaTE 2025 Workshop. The goal of the challenge is to advance research on spoken language assessment and feedback, with tasks associated with both the underlying technology and language learning feedback. Linked with the challenge, the Speak & Improve (S&I) Corpu… ▽ More

    Submitted 17 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

  2. arXiv:2309.01576  [pdf, other

    cs.CL cs.SD eess.AS

    A Comparative Analysis of Pretrained Language Models for Text-to-Speech

    Authors: Marcel Granero-Moya, Penny Karanasou, Sri Karlapati, Bastian Schnell, Nicole Peinelt, Alexis Moinet, Thomas Drugman

    Abstract: State-of-the-art text-to-speech (TTS) systems have utilized pretrained language models (PLMs) to enhance prosody and create more natural-sounding speech. However, while PLMs have been extensively researched for natural language understanding (NLU), their impact on TTS has been overlooked. In this study, we aim to address this gap by conducting a comparative analysis of different PLMs for two TTS t… ▽ More

    Submitted 4 September, 2023; originally announced September 2023.

    Comments: Accepted for presentation at the 12th ISCA Speech Synthesis Workshop (SSW) in Grenoble, France, from 26th to 28th August 2023

  3. arXiv:2307.07062  [pdf, other

    eess.AS cs.LG cs.SD

    Controllable Emphasis with zero data for text-to-speech

    Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

    Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

  4. arXiv:2306.11327  [pdf, other

    eess.AS cs.SD

    eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer

    Authors: Ammar Abbas, Sri Karlapati, Bastian Schnell, Penny Karanasou, Marcel Granero Moya, Amith Nagaraj, Ayman Boustati, Nicole Peinelt, Alexis Moinet, Thomas Drugman

    Abstract: We present eCat, a novel end-to-end multispeaker model capable of: a) generating long-context speech with expressive and contextually appropriate prosody, and b) performing fine-grained prosody transfer between any pair of seen speakers. eCat is trained using a two-stage training approach. In Stage I, the model learns speaker-independent word-level prosody representations in an end-to-end fashion… ▽ More

    Submitted 20 June, 2023; originally announced June 2023.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2023

  5. arXiv:2206.14643  [pdf, other

    eess.AS cs.CL

    Simple and Effective Multi-sentence TTS with Expressive and Coherent Prosody

    Authors: Peter Makarov, Ammar Abbas, Mateusz Ɓajszczak, Arnaud Joly, Sri Karlapati, Alexis Moinet, Thomas Drugman, Penny Karanasou

    Abstract: Generating expressive and contextually appropriate prosody remains a challenge for modern text-to-speech (TTS) systems. This is particularly evident for long, multi-sentence inputs. In this paper, we examine simple extensions to a Transformer-based FastSpeech-like system, with the goal of improving prosody for multi-sentence TTS. We find that long context, powerful text features, and training on m… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  6. arXiv:2206.13443  [pdf, other

    eess.AS cs.SD

    CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer

    Authors: Sri Karlapati, Penny Karanasou, Mateusz Lajszczak, Ammar Abbas, Alexis Moinet, Peter Makarov, Ray Li, Arent van Korlaar, Simon Slangen, Thomas Drugman

    Abstract: In this paper, we present CopyCat2 (CC2), a novel model capable of: a) synthesizing speech with different speaker identities, b) generating speech with expressive and contextually appropriate prosody, and c) transferring prosody at fine-grained level between any pair of seen speakers. We do this by activating distinct parts of the network for different tasks. We train our model using a novel appro… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

    Comments: Accepted to be published in the Proceedings of InterSpeech 2022

  7. arXiv:2106.15649  [pdf, other

    eess.AS cs.LG cs.SD

    Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

    Authors: Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati, Thomas Drugman

    Abstract: We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale me… ▽ More

    Submitted 29 June, 2021; originally announced June 2021.

    Comments: Accepted for the 11th ISCA Speech Synthesis Workshop (SSW11)

  8. arXiv:2106.10229  [pdf, other

    eess.AS cs.LG cs.SD

    A learned conditional prior for the VAE acoustic space of a TTS system

    Authors: Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo Trueba, Thomas Drugman

    Abstract: Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

    Comments: in Proceedings of Interspeech 2021

  9. arXiv:2011.02252  [pdf, other

    eess.AS cs.CL cs.SD

    Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech

    Authors: Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly, Penny Karanasou, Thomas Drugman

    Abstract: In this paper, we introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody. In Stage I, we learn a prosodic distribution at the sentence level from mel-spectrograms available during training. In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information ava… ▽ More

    Submitted 4 November, 2020; originally announced November 2020.

    Comments: 5 pages and 3 figures

  10. arXiv:1805.09119  [pdf, ps, other

    cs.CL

    Selecting Machine-Translated Data for Quick Bootstrapping of a Natural Language Understanding System

    Authors: Judith Gaspers, Penny Karanasou, Rajen Chatterjee

    Abstract: This paper investigates the use of Machine Translation (MT) to bootstrap a Natural Language Understanding (NLU) system for a new language for the use case of a large-scale voice-controlled device. The goal is to decrease the cost and time needed to get an annotated corpus for the new language, while still having a large enough coverage of user requests. Different methods of filtering MT data in or… ▽ More

    Submitted 23 May, 2018; originally announced May 2018.