Skip to main content

Showing 1–5 of 5 results for author: Ekstedt, E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2403.06487  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The re… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for presentation at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) and represents the author's version of the work

  2. arXiv:2401.04868  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input contex… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  3. arXiv:2305.17971  [pdf, other

    eess.AS cs.SD

    Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

    Authors: Erik Ekstedt, Siyang Wang, Éva Székely, Joakim Gustafson, Gabriel Skantze

    Abstract: Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023, 5 pages, 2 figures, 4 tables

  4. arXiv:2209.05161  [pdf, other

    eess.AS

    How Much Does Prosody Help Turn-taking? Investigations using Voice Activity Projection Models

    Authors: Erik Ekstedt, Gabriel Skantze

    Abstract: Turn-taking is a fundamental aspect of human communication and can be described as the ability to take turns, project upcoming turn shifts, and supply backchannels at appropriate locations throughout a conversation. In this work, we investigate the role of prosody in turn-taking using the recently proposed Voice Activity Projection model, which incrementally models the upcoming speech activity of… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

    Comments: SIGDIAL 2022 Best Paper Award Winner

  5. arXiv:2205.09812  [pdf, other

    eess.AS cs.SD

    Voice Activity Projection: Self-supervised Learning of Turn-taking Events

    Authors: Erik Ekstedt, Gabriel Skantze

    Abstract: The modeling of turn-taking in dialog can be viewed as the modeling of the dynamics of voice activity of the interlocutors. We extend prior work and define the predictive task of Voice Activity Projection, a general, self-supervised objective, as a way to train turn-taking models without the need of labeled data. We highlight a theoretical weakness with prior approaches, arguing for the need of mo… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 4 figures