Search | arXiv e-print repository

Sounding Like a Winner? Prosodic Differences in Post-Match Interviews

Abstract: This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models… ▽ More This study examines the prosodic characteristics associated with winning and losing in post-match tennis interviews. Additionally, this research explores the potential to classify match outcomes solely based on post-match interview recordings using prosodic features and self-supervised learning (SSL) representations. By analyzing prosodic elements such as pitch and intensity, alongside SSL models like Wav2Vec 2.0 and HuBERT, the aim is to determine whether an athlete has won or lost their match. Traditional acoustic features and deep speech representations are extracted from the data, and machine learning classifiers are employed to distinguish between winning and losing players. Results indicate that SSL representations effectively differentiate between winning and losing outcomes, capturing subtle speech patterns linked to emotional states. At the same time, prosodic cues -- such as pitch variability -- remain strong indicators of victory. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Accepted to Interspeech 2025

arXiv:2506.02239 [pdf, ps, other]

Investigating the Impact of Word Informativeness on Speech Emotion Recognition

Authors: Sofoklis Kakouros

Abstract: In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research… ▽ More In emotion recognition from speech, a key challenge lies in identifying speech signal segments that carry the most relevant acoustic variations for discerning specific emotions. Traditional approaches compute functionals for features such as energy and F0 over entire sentences or longer speech portions, potentially missing essential fine-grained variation in the long-form statistics. This research investigates the use of word informativeness, derived from a pre-trained language model, to identify semantically important segments. Acoustic features are then computed exclusively for these identified segments, enhancing emotion recognition accuracy. The methodology utilizes standard acoustic prosodic features, their functionals, and self-supervised representations. Results indicate a notable improvement in recognition performance when features are computed on segments selected based on word informativeness, underscoring the effectiveness of this approach. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: Accepted to Interspeech 2025

arXiv:2306.09814 [pdf, other]

Investigating the Utility of Surprisal from Large Language Models for Speech Synthesis Prosody

Authors: Sofoklis Kakouros, Juraj Šimko, Martti Vainio, Antti Suni

Abstract: This paper investigates the use of word surprisal, a measure of the predictability of a word in a given context, as a feature to aid speech synthesis prosody. We explore how word surprisal extracted from large language models (LLMs) correlates with word prominence, a signal-based measure of the salience of a word in a given discourse. We also examine how context length and LLM size affect the resu… ▽ More This paper investigates the use of word surprisal, a measure of the predictability of a word in a given context, as a feature to aid speech synthesis prosody. We explore how word surprisal extracted from large language models (LLMs) correlates with word prominence, a signal-based measure of the salience of a word in a given discourse. We also examine how context length and LLM size affect the results, and how a speech synthesizer conditioned with surprisal values compares with a baseline system. To evaluate these factors, we conducted experiments using a large corpus of English text and LLMs of varying sizes. Our results show that word surprisal and word prominence are moderately correlated, suggesting that they capture related but distinct aspects of language use. We find that length of context and size of the LLM impact the correlations, but not in the direction anticipated, with longer contexts and larger LLMs generally underpredicting prominent words in a nearly linear manner. We demonstrate that, in line with these findings, a speech synthesizer conditioned with surprisal values provides a minimal improvement over the baseline with the results suggesting a limited effect of using surprisal values for eliciting appropriate prominence patterns. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: Accepted at SSW 2023

arXiv:2305.16040 [pdf, other]

The Power of Prosody and Prosody of Power: An Acoustic Analysis of Finnish Parliamentary Speech

Authors: Martti Vainio, Antti Suni, Juraj Šimko, Sofoklis Kakouros

Abstract: Parliamentary recordings provide a rich source of data for studying how politicians use speech to convey their messages and influence their audience. This provides a unique context for studying how politicians use speech, especially prosody, to achieve their goals. Here we analyzed a corpus of parliamentary speeches in the Finnish parliament between the years 2008-2020 and highlight methodological… ▽ More Parliamentary recordings provide a rich source of data for studying how politicians use speech to convey their messages and influence their audience. This provides a unique context for studying how politicians use speech, especially prosody, to achieve their goals. Here we analyzed a corpus of parliamentary speeches in the Finnish parliament between the years 2008-2020 and highlight methodological considerations related to the robustness of signal based features with respect to varying recording conditions and corpus design. We also present results of long term changes pertaining to speakers' status with respect to their party being in government or in opposition. Looking at large scale averages of fundamental frequency - a robust prosodic feature - we found systematic changes in speech prosody with respect opposition status and the election term. Reflecting a different level of urgency, members of the parliament have higher f0 at the beginning of the term or when they are in opposition. △ Less

Submitted 25 May, 2023; originally announced May 2023.

arXiv:2305.11864 [pdf, other]

North Sámi Dialect Identification with Self-supervised Speech Models

Authors: Sofoklis Kakouros, Katri Hiovain-Asikainen

Abstract: The North Sámi (NS) language encapsulates four primary dialectal variants that are related but that also have differences in their phonology, morphology, and vocabulary. The unique geopolitical location of NS speakers means that in many cases they are bilingual in Sámi as well as in the dominant state language: Norwegian, Swedish, or Finnish. This enables us to study the NS variants both with resp… ▽ More The North Sámi (NS) language encapsulates four primary dialectal variants that are related but that also have differences in their phonology, morphology, and vocabulary. The unique geopolitical location of NS speakers means that in many cases they are bilingual in Sámi as well as in the dominant state language: Norwegian, Swedish, or Finnish. This enables us to study the NS variants both with respect to the spoken state language and their acoustic characteristics. In this paper, we investigate an extensive set of acoustic features, including MFCCs and prosodic features, as well as state-of-the-art self-supervised representations, namely, XLS-R, WavLM, and HuBERT, for the automatic detection of the four NS variants. In addition, we examine how the majority state language is reflected in the dialects. Our results show that NS dialects are influenced by the state language and that the four dialects are separable, reaching high classification accuracy, especially with the XLS-R model. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: Accepted at Interspeech 2023

arXiv:2211.01756 [pdf, other]

Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

Authors: Sofoklis Kakouros, Themos Stafylakis, Ladislav Mosner, Lukas Burget

Abstract: When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognit… ▽ More When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognition. However, better ways of aggregating the information across time need to be considered as the relevant emotion information is likely to appear piecewise and not uniformly across the signal. For the labels, we need to take into account that there is a substantial degree of noise that comes from the subjective human annotations. In this paper, we propose a novel approach to attentive pooling based on correlations between the representations' coefficients combined with label smoothing, a method aiming to reduce the confidence of the classifier on the training labels. We evaluate our proposed approach on the benchmark dataset IEMOCAP, and demonstrate high performance surpassing that in the literature. The code to reproduce the results is available at github.com/skakouros/s3prl_attentive_correlation. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: Submitted to IEEE-ICASSP 2023

arXiv:2210.09513 [pdf, other]

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Authors: Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

Abstract: Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alte… ▽ More Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alternative way of extracting speaker and emotion information from self-supervised trained models, based on the correlations between the coefficients of the representations - correlation pooling. We show improvements over mean pooling and further gains when the pooling methods are combined via fusion. The code is available at github.com/Lamomal/s3prl_correlation. △ Less

Submitted 15 October, 2022; originally announced October 2022.

Comments: Accepted at IEEE-SLT 2022

arXiv:2006.15967 [pdf, other]

doi 10.21437/SpeechProsody.2020-192

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Authors: Antti Suni, Sofoklis Kakouros, Martti Vainio, Juraj Šimko

Abstract: Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variat… ▽ More Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information. The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations. △ Less

Submitted 29 June, 2020; originally announced June 2020.

arXiv:2003.10183 [pdf, other]

doi 10.21437/SpeechProsody.2020-128

Dialect Identification of Spoken North Sámi Language Varieties Using Prosodic Features

Authors: Sofoklis Kakouros, Katri Hiovain, Martti Vainio, Juraj Šimko

Abstract: This work explores the application of various supervised classification approaches using prosodic information for the identification of spoken North Sámi language varieties. Dialects are language varieties that enclose characteristics specific for a given region or community. These characteristics reflect segmental and suprasegmental (prosodic) differences but also high-level properties such as le… ▽ More This work explores the application of various supervised classification approaches using prosodic information for the identification of spoken North Sámi language varieties. Dialects are language varieties that enclose characteristics specific for a given region or community. These characteristics reflect segmental and suprasegmental (prosodic) differences but also high-level properties such as lexical and morphosyntactic. One aspect that is of particular interest and that has not been studied extensively is how the differences in prosody may underpin the potential differences among different dialects. To address this, this work focuses on investigating the standard acoustic prosodic features of energy, fundamental frequency, spectral tilt, duration, and their combinations, using sequential and context-independent supervised classification methods, and evaluated separately over two different units in speech: words and syllables. The primary aim of this work is to gain a better understanding on the role of prosody in identifying among the different language varieties. Our results show that prosodic information holds an important role in distinguishing between the five areal varieties of North Sámi where the inclusion of contextual information for all acoustic prosodic features is critical for the identification of dialects for words and syllables. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Showing 1–9 of 9 results for author: Kakouros, S