Skip to main content

Showing 1–6 of 6 results for author: Beringer, G

Searching in archive eess. Search in all archives.
.
  1. Investigating self-supervised features for expressive, multilingual voice conversion

    Authors: Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Grzegorz Beringer, Iván Vallés-Pérez, Roberto Barra-Chicote, Biel Tura-Vecino, Adam Gabryś, Piotr Bilinski, Thomas Merritt, Jaime Lorenzo-Trueba

    Abstract: Voice conversion (VC) systems are widely used for several applications, from speaker anonymisation to personalised speech synthesis. Supervised approaches learn a mapping between different speakers using parallel data, which is expensive to produce. Unsupervised approaches are typically trained to reconstruct the input signal, which is composed of the content and the speaker information. Disentang… ▽ More

    Submitted 13 May, 2025; originally announced May 2025.

    Comments: Published as a conference paper at ICASSP 2024

    Journal ref: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

  2. arXiv:2402.03407  [pdf, other

    eess.AS cs.CL cs.LG

    Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

    Authors: Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Iván Vallés-Pérez, Biel Tura-Vecino, Piotr Biliński, Mateusz Lajszczak, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba

    Abstract: Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skipping or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) archite… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 10 pages, 1 figure, 3 tables

  3. arXiv:2307.12445  [pdf, other

    cs.SD cs.AI eess.AS

    SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces

    Authors: Ivan Vallés-Pérez, Grzegorz Beringer, Piotr Bilinski, Gary Cook, Roberto Barra-Chicote

    Abstract: Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where th… ▽ More

    Submitted 30 January, 2024; v1 submitted 23 July, 2023; originally announced July 2023.

    Comments: In proceedings of the 26th European Conference on Artificial Intelligence ECAI 2023. 8 pages + 1 appendix page

  4. arXiv:2207.01454  [pdf, other

    eess.AS cs.CL cs.LG

    GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion

    Authors: Magdalena Proszewska, Grzegorz Beringer, Daniel Sáez-Trigueros, Thomas Merritt, Abdelhamid Ezzerg, Roberto Barra-Chicote

    Abstract: In this paper, we propose GlowVC: a multilingual multi-speaker flow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables use of linguistic features during training without the necessity of using them for VC inference. We consider two versions of our model: GlowVC-conditional and GlowVC-explicit. GlowVC-conditional models… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  5. arXiv:2106.05762  [pdf, other

    cs.SD cs.CL eess.AS

    Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

    Authors: Iván Vallés-Pérez, Julian Roth, Grzegorz Beringer, Roberto Barra-Chicote, Jasha Droppo

    Abstract: Text-to-speech systems recently achieved almost indistinguishable quality from human speech. However, the prosody of those systems is generally flatter than natural speech, producing samples with low expressiveness. Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses. This paper proposes a new neural text-to-s… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: in Proceedings of Interspeech 2021 conference

  6. arXiv:2012.14788  [pdf, other

    eess.AS cs.SD

    Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

    Authors: Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

    Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learni… ▽ More

    Submitted 7 June, 2021; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021