Skip to main content

Showing 1–50 of 254 results for author: Schuller, B

.
  1. arXiv:2506.13127  [pdf, ps, other

    cs.SD eess.AS

    I$^2$S-TFCKD: Intra-Inter Set Knowledge Distillation with Time-Frequency Calibration for Speech Enhancement

    Authors: Jiaming Cheng, Ruiyu Liang, Chao Xu, Ye Ni, Wei Zhou, Björn W. Schuller, Xiaoshuai Hao

    Abstract: In recent years, complexity compression of neural network (NN)-based speech enhancement (SE) models has gradually attracted the attention of researchers, especially in scenarios with limited hardware resources or strict latency requirements. The main difficulties and challenges lie in achieving a balance between complexity and performance according to the characteristics of the task. In this paper… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

    Comments: submitted to IEEE Transactions on Neural Networks and Learning Systems

  2. arXiv:2505.24493  [pdf, ps, other

    cs.AI cs.SD eess.AS

    MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge

    Authors: Xin Jing, Jiadong Wang, Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

    Abstract: Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scala… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  3. arXiv:2505.22863  [pdf, other

    cs.HC cs.CL

    Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge

    Authors: Yupei Li, Shuaijie Shao, Manuel Milling, Björn W. Schuller

    Abstract: Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rath… ▽ More

    Submitted 28 May, 2025; originally announced May 2025.

  4. arXiv:2505.13814  [pdf, ps, other

    eess.AS cs.AI cs.SD

    Articulatory Feature Prediction from Surface EMG during Speech Production

    Authors: Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica Gonzalez-Machorro, Yoonjeong Lee, Björn Schuller, Louis Goldstein, Shrikanth Narayanan

    Abstract: We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that t… ▽ More

    Submitted 28 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted for Interspeech2025

  5. arXiv:2505.10034  [pdf, ps, other

    cs.AI

    The First MPDD Challenge: Multimodal Personality-aware Depression Detection

    Authors: Changzeng Fu, Zelin Fu, Qi Zhang, Xinhe Kuang, Jiacheng Dong, Kaifeng Su, Yikai Su, Wenbo Shi, Junfeng Yao, Yuliang Zhao, Shiqi Zhao, Jiadong Wang, Siyang Song, Chaoran Liu, Yuichiro Yoshikawa, Björn Schuller, Hiroshi Ishiguro

    Abstract: Depression is a widespread mental health issue affecting diverse age groups, with notable prevalence among college students and the elderly. However, existing datasets and detection methods primarily focus on young adults, neglecting the broader age spectrum and individual differences that influence depression manifestation. Current approaches often establish a direct mapping between multimodal da… ▽ More

    Submitted 28 May, 2025; v1 submitted 15 May, 2025; originally announced May 2025.

    Comments: This paper has been accepted as part of the MPDD Challenge in the ACMMM 2025 Grand Challenge

    MSC Class: 68T07 ACM Class: I.2.0; H.5.1

  6. arXiv:2504.20776  [pdf

    cs.SD cs.AI eess.AS

    ECOSoundSet: a finely annotated dataset for the automated acoustic identification of Orthoptera and Cicadidae in North, Central and temperate Western Europe

    Authors: David Funosas, Elodie Massol, Yves Bas, Svenja Schmidt, Dominik Arend, Alexander Gebhard, Luc Barbaro, Sebastian König, Rafael Carbonell Font, David Sannier, Fernand Deroussen, Jérôme Sueur, Christian Roesti, Tomi Trilar, Wolfgang Forstmeier, Lucas Roger, Eloïsa Matheu, Piotr Guzik, Julien Barataud, Laurent Pelozuelo, Stéphane Puissant, Sandra Mueller, Björn Schuller, Jose M. Montoya, Andreas Triantafyllopoulos , et al. (1 additional authors not shown)

    Abstract: Currently available tools for the automated acoustic recognition of European insects in natural soundscapes are limited in scope. Large and ecologically heterogeneous acoustic datasets are currently needed for these algorithms to cross-contextually recognize the subtle and complex acoustic signatures produced by each species, thus making the availability of such datasets a key requisite for their… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 3 Figures + 2 Supplementary Figures, 2 Tables + 3 Supplementary Tables

  7. arXiv:2504.19423  [pdf, other

    cs.HC

    MER 2025: When Affective Computing Meets Large Language Models

    Authors: Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, Ziyang Ma, Xiaojiang Peng, Xie Chen, Ya Li, Erik Cambria, Guoying Zhao, Björn W. Schuller, Jianhua Tao

    Abstract: MER2025 is the third year of our MER series of challenges, aiming to bring together researchers in the affective computing community to explore emerging trends and future directions in the field. Previously, MER2023 focused on multi-label learning, noise robustness, and semi-supervised learning, while MER2024 introduced a new track dedicated to open-vocabulary emotion recognition. This year, MER20… ▽ More

    Submitted 29 April, 2025; v1 submitted 27 April, 2025; originally announced April 2025.

  8. arXiv:2503.21419  [pdf, other

    cs.AI

    Neuroplasticity in Artificial Intelligence -- An Overview and Inspirations on Drop In & Out Learning

    Authors: Yupei Li, Manuel Milling, Björn W. Schuller

    Abstract: Artificial Intelligence (AI) has achieved new levels of performance and spread in public usage with the rise of deep neural networks (DNNs). Initially inspired by human neurons and their connections, NNs have become the foundation of AI models for many advanced architectures. However, some of the most integral processes in the human brain, particularly neurogenesis and neuroplasticity in addition… ▽ More

    Submitted 25 April, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

  9. arXiv:2503.20919  [pdf, other

    cs.CL cs.SD eess.AS

    GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

    Authors: Yupei Li, Qiyang Sun, Sunil Munthumoduku Krishna Murthy, Emran Alturki, Björn W. Schuller

    Abstract: Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverage… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  10. arXiv:2503.09368  [pdf, other

    cs.CV eess.IV

    PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling

    Authors: Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller

    Abstract: We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  11. arXiv:2501.12122  [pdf, other

    cs.SD eess.AS

    DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

    Authors: Yupei Li, Zifan Wei, Heng Yu, Huichi Zhou, Björn W. Schuller

    Abstract: Code-switching, the alternation between two or more languages within communication, poses great challenges for Automatic Speech Recognition (ASR) systems. Existing models and datasets are limited in their ability to effectively handle these challenges. To address this gap and foster progress in code-switching ASR research, we introduce the DOTA-ME-CS: Daily oriented text audio Mandarin-English cod… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  12. arXiv:2501.12050  [pdf, other

    cs.LG cs.SD eess.AS

    Representation Learning with Parameterised Quantum Circuits for Advancing Speech Emotion Recognition

    Authors: Thejan Rajapakshe, Rajib Rana, Farina Riaz, Sara Khalifa, Björn W. Schuller

    Abstract: Speech Emotion Recognition (SER) is a complex and challenging task in human-computer interaction due to the intricate dependencies of features and the overlapping nature of emotional expressions conveyed through speech. Although traditional deep learning methods have shown effectiveness, they often struggle to capture subtle emotional variations and overlapping states. This paper introduces a hybr… ▽ More

    Submitted 28 January, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

  13. arXiv:2501.10525  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids

    Authors: Iosif Tsangko, Andreas Triantafyllopoulos, Michael Müller, Hendrik Schröter, Björn W. Schuller

    Abstract: The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper it… ▽ More

    Submitted 23 January, 2025; v1 submitted 17 January, 2025; originally announced January 2025.

    Comments: Comments: Accepted at ICASSP 2025. 5 pages, 3 figures

    ACM Class: I.2.6; H.5.5; I.5.1; I.4.8

  14. arXiv:2501.04292  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

    Authors: Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto

    Abstract: The Mice Autism Detection via Ultrasound Vocalization (MADUV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple… ▽ More

    Submitted 31 May, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: 5 pages, 1 figure and 2 tables. Submitted to INTERSPEECH 2025. For MADUV Challenge 2025

  15. arXiv:2501.01987  [pdf

    cs.CV cs.AI cs.CY cs.LG

    Gender Bias in Text-to-Video Generation Models: A case study of Sora

    Authors: Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain

    Abstract: The advent of text-to-video generation models has revolutionized content creation as it produces high-quality videos from textual prompts. However, concerns regarding inherent biases in such models have prompted scrutiny, particularly regarding gender representation. Our study investigates the presence of gender bias in OpenAI's Sora, a state-of-the-art text-to-video generation model. We uncover s… ▽ More

    Submitted 10 January, 2025; v1 submitted 30 December, 2024; originally announced January 2025.

    Comments: 7 pages, 3 figures

  16. arXiv:2412.15114  [pdf, other

    cs.AI cs.CY

    Towards Friendly AI: A Comprehensive Review and New Perspectives on Human-AI Alignment

    Authors: Qiyang Sun, Yupei Li, Emran Alturki, Sunil Munthumoduku Krishna Murthy, Björn W. Schuller

    Abstract: As Artificial Intelligence (AI) continues to advance rapidly, Friendly AI (FAI) has been proposed to advocate for more equitable and fair development of AI. Despite its importance, there is a lack of comprehensive reviews examining FAI from an ethical perspective, as well as limited discussion on its potential applications and future directions. This paper addresses these gaps by providing a thoro… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    ACM Class: I.2.0; K.4.0

  17. arXiv:2412.13421  [pdf, other

    cs.SD eess.AS

    Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

    Authors: Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, Björn W. Schuller

    Abstract: Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions.… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  18. arXiv:2412.12679  [pdf, other

    cs.CL

    Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features

    Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller

    Abstract: The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-lev… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  19. arXiv:2412.11943  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks

    Authors: Simon Rampp, Andreas Triantafyllopoulos, Manuel Milling, Björn W. Schuller

    Abstract: This work introduces the key operating principles for autrainer, our new deep learning training framework for computer audition tasks. autrainer is a PyTorch-based toolkit that allows for rapid, reproducible, and easily extensible training on a variety of different computer audition tasks. Concretely, autrainer offers low-code training and supports a wide range of neural networks as well as prepro… ▽ More

    Submitted 10 April, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

  20. arXiv:2412.11795  [pdf, other

    cs.CL cs.SD eess.AS

    ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

    Authors: Xiangheng He, Junjie Chen, Zixing Zhang, Björn W. Schuller

    Abstract: Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a f… ▽ More

    Submitted 19 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  21. arXiv:2412.06001  [pdf, other

    cs.SD cs.MM eess.AS

    M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases

    Authors: Yupei Li, Hanqian Li, Lucia Specia, Björn W. Schuller

    Abstract: Machine-generated music (MGM) has emerged as a powerful tool with applications in music therapy, personalised editing, and creative inspiration for the music community. However, its unregulated use threatens the entertainment, education, and arts sectors by diminishing the value of high-quality human compositions. Detecting machine-generated music (MGMD) is, therefore, critical to safeguarding the… ▽ More

    Submitted 8 December, 2024; originally announced December 2024.

  22. arXiv:2412.01829  [pdf, other

    cs.LG cs.CV

    Explainable Artificial Intelligence for Medical Applications: A Review

    Authors: Qiyang Sun, Alican Akman, Björn W. Schuller

    Abstract: The continuous development of artificial intelligence (AI) theory has propelled this field to unprecedented heights, owing to the relentless efforts of scholars and researchers. In the medical realm, AI takes a pivotal role, leveraging robust machine learning (ML) algorithms. AI technology in medical imaging aids physicians in X-ray, computed tomography (CT) scans, and magnetic resonance imaging (… ▽ More

    Submitted 15 November, 2024; originally announced December 2024.

  23. arXiv:2412.00571  [pdf, other

    cs.SD eess.AS

    From Audio Deepfake Detection to AI-Generated Music Detection -- A Pathway and Overview

    Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller

    Abstract: As Artificial Intelligence (AI) technologies continue to evolve, their use in generating realistic, contextually appropriate content has expanded into various domains. Music, an art form and medium for entertainment, deeply rooted into human culture, is seeing an increased involvement of AI into its production. However, despite the effective application of AI music generation (AIGM) tools, the unr… ▽ More

    Submitted 10 December, 2024; v1 submitted 30 November, 2024; originally announced December 2024.

  24. arXiv:2412.00312  [pdf, other

    cs.SD cs.AI eess.AS

    Raw Audio Classification with Cosine Convolutional Neural Network (CosCovNN)

    Authors: Kazi Nazmul Haque, Rajib Rana, Tasnim Jarin, Bjorn W. Schuller Jr

    Abstract: This study explores the field of audio classification from raw waveform using Convolutional Neural Networks (CNNs), a method that eliminates the need for extracting specialised features in the pre-processing step. Unlike recent trends in literature, which often focuses on designing frontends or filters for only the initial layers of CNNs, our research introduces the Cosine Convolutional Neural Net… ▽ More

    Submitted 29 November, 2024; originally announced December 2024.

  25. arXiv:2411.11880  [pdf, other

    cs.CY

    Large language models for mental health

    Authors: Andreas Triantafyllopoulos, Yannik Terhorst, Iosif Tsangko, Florian B. Pokorny, Katrin D. Bartl-Pokorny, Lennart Seizer, Ayal Klein, Jenny Chim, Dana Atzil-Slonim, Maria Liakata, Markus Bühner, Johanna Löchner, Björn Schuller

    Abstract: Digital technologies have long been explored as a complement to standard procedure in mental health research and practice, ranging from the management of electronic health records to app-based interventions. The recent emergence of large language models (LLMs), both proprietary and open-source ones, represents a major new opportunity on that front. Yet there is still a divide between the community… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  26. arXiv:2411.11541  [pdf, other

    cs.SD eess.AS

    Using voice analysis as an early indicator of risk for depression in young adults

    Authors: Klaus R. Scherer, Felix Burkhardt, Uwe D. Reichel, Florian Eyben, Björn W. Schuller

    Abstract: Increasingly frequent publications in the literature report voice quality differences between depressed patients and controls. Here, we examine the possibility of using voice analysis as an early warning signal for the development of emotion disturbances in young adults. As part of a major interdisciplinary European research project in four countries (ECoWeB), examining the effects of web-based pr… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

    Comments: Submitted to ToaC

  27. arXiv:2411.00973  [pdf, other

    cs.LG

    Does the Definition of Difficulty Matter? Scoring Functions and their Role for Curriculum Learning

    Authors: Simon Rampp, Manuel Milling, Andreas Triantafyllopoulos, Björn W. Schuller

    Abstract: Curriculum learning (CL) describes a machine learning training strategy in which samples are gradually introduced into the training process based on their difficulty. Despite a partially contradictory body of evidence in the literature, CL finds popularity in deep learning research due to its promise of leveraging human-inspired curricula to achieve higher model performance. Yet, the subjectivity… ▽ More

    Submitted 1 November, 2024; originally announced November 2024.

  28. arXiv:2410.11120  [pdf, other

    cs.SD cs.AI eess.AS

    Audio-based Kinship Verification Using Age Domain Conversion

    Authors: Qiyang Sun, Alican Akman, Xin Jing, Manuel Milling, Björn W. Schuller

    Abstract: Audio-based kinship verification (AKV) is important in many domains, such as home security monitoring, forensic identification, and social network analysis. A key challenge in the task arises from differences in age across samples from different individuals, which can be interpreted as a domain bias in a cross-domain verification task. To address this issue, we design the notion of an "age-standar… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 4 pages, 2 figures, submitted to IEEE Signal Processing Letters

    MSC Class: 68T10 ACM Class: I.5.4; I.2.6

  29. arXiv:2410.07530  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Explanation Synthesis with Generative Foundation Models

    Authors: Alican Akman, Qiyang Sun, Björn W. Schuller

    Abstract: The increasing success of audio foundation models across various tasks has led to a growing need for improved interpretability to understand their intricate decision-making processes better. Existing methods primarily focus on explaining these models by attributing importance to elements within the input space based on their influence on the final decision. In this paper, we introduce a novel audi… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  30. arXiv:2409.20255  [pdf, other

    cs.CV

    PerCo (SD): Open Perceptual Compression

    Authors: Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller

    Abstract: We introduce PerCo (SD), a perceptual image compression method based on Stable Diffusion v2.1, targeting the ultra-low bit range. PerCo (SD) serves as an open and competitive alternative to the state-of-the-art method PerCo, which relies on a proprietary variant of GLIDE and remains closed to the public. In this work, we review the theoretical foundations, discuss key engineering decisions in adap… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

  31. arXiv:2409.17392  [pdf, other

    cs.LG q-fin.TR

    Trading through Earnings Seasons using Self-Supervised Contrastive Representation Learning

    Authors: Zhengxin Joseph Ye, Bjoern Schuller

    Abstract: Earnings release is a key economic event in the financial markets and crucial for predicting stock movements. Earnings data gives a glimpse into how a company is doing financially and can hint at where its stock might go next. However, the irregularity of its release cycle makes it a challenge to incorporate this data in a medium-frequency algorithmic trading model and the usefulness of this data… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

  32. arXiv:2409.08907  [pdf, other

    cs.AI cs.CL cs.CY

    Affective Computing Has Changed: The Foundation Model Disruption

    Authors: Björn Schuller, Adria Mallol-Ragolta, Alejandro Peña Almansa, Iosif Tsangko, Mostafa M. Amin, Anastasia Semertzidou, Lukas Christ, Shahin Amiriparian

    Abstract: The dawn of Foundation Models has on the one hand revolutionised a wide range of research problems, and, on the other hand, democratised the access and use of AI-based tools by the general public. We even observe an incursion of these models into disciplines related to human psychology, such as the Affective Computing domain, suggesting their affective, emerging capabilities. In this work, we aim… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

  33. arXiv:2409.06451  [pdf, other

    cs.SD eess.AS

    Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

    Authors: Xin Jing, Kun Zhou, Andreas Triantafyllopoulos, Björn W. Schuller

    Abstract: While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

  34. arXiv:2409.00105  [pdf

    cs.CL cs.AI cs.LG

    Negation Blindness in Large Language Models: Unveiling the NO Syndrome in Image Generation

    Authors: Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Björn W. Schuller, Amir Hussain

    Abstract: Foundational Large Language Models (LLMs) have changed the way we perceive technology. They have been shown to excel in tasks ranging from poem writing and coding to essay generation and puzzle solving. With the incorporation of image generation capability, they have become more comprehensive and versatile AI tools. At the same time, researchers are striving to identify the limitations of these to… ▽ More

    Submitted 4 September, 2024; v1 submitted 27 August, 2024; originally announced September 2024.

    Comments: 15 pages, 7 figures

  35. arXiv:2408.13920  [pdf, other

    cs.SD eess.AS

    Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

    Authors: Dionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Björn Schuller

    Abstract: Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlati… ▽ More

    Submitted 22 November, 2024; v1 submitted 25 August, 2024; originally announced August 2024.

    Comments: apply review

  36. arXiv:2408.06264  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

    Authors: Manuel Milling, Shuo Liu, Andreas Triantafyllopoulos, Ilhan Aslan, Björn W. Schuller

    Abstract: Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solu… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  37. Abusive Speech Detection in Indic Languages Using Acoustic Features

    Authors: Anika A. Spiesberger, Andreas Triantafyllopoulos, Iosif Tsangko, Björn W. Schuller

    Abstract: Abusive content in online social networks is a well-known problem that can cause serious psychological harm and incite hatred. The ability to upload audio data increases the importance of developing methods to detect abusive content in speech recordings. However, simply transferring the mechanisms from written abuse detection would ignore relevant information such as emotion and tone. In addition,… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Journal ref: Proc. INTERSPEECH 2023, 2683-2687

  38. arXiv:2407.15672  [pdf, other

    cs.SD eess.AS

    Computer Audition: From Task-Specific Machine Learning to Foundation Models

    Authors: Andreas Triantafyllopoulos, Iosif Tsangko, Alexander Gebhard, Annamaria Mesaros, Tuomas Virtanen, Björn Schuller

    Abstract: Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-availab… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  39. arXiv:2407.11012  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment

    Authors: Maurice Gerczuk, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller

    Abstract: In emergency medicine, timely intervention for patients at risk of suicide is often hindered by delayed access to specialised psychiatric care. To bridge this gap, we introduce a speech-based approach for automatic suicide risk assessment. Our study involves a novel dataset comprising speech recordings of 20 patients who read neutral texts. We extract four speech representations encompassing inter… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

    Comments: accepted at INTERSPEECH 2024

    MSC Class: 68T10 ACM Class: J.3

  40. arXiv:2407.02751  [pdf, other

    cs.CL cs.AI

    Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset

    Authors: Rui Liu, Haolin Zuo, Zheng Lian, Xiaofen Xing, Björn W. Schuller, Haizhou Li

    Abstract: Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history, while inferring the emotions and intents simultaneously for the current utterance. MC-EIU is enabling technology for many human-computer interfaces. However, there is a lack of available datasets in terms of annotation, modality, lang… ▽ More

    Submitted 4 July, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: 26 pages, 8 figures, 12 tables, NeurIPS 2024 Dataset and Benchmark Track

  41. arXiv:2407.01143  [pdf, other

    cs.SD cs.AI eess.AS

    Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition

    Authors: Oliver Schrüfer, Manuel Milling, Felix Burkhardt, Florian Eyben, Björn Schuller

    Abstract: Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Rel… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: accepted for Interspeech 2024, 5 pages

  42. arXiv:2406.17667  [pdf, other

    cs.SD cs.CL eess.AS

    This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach

    Authors: Lukas Christ, Shahin Amiriparian, Friederike Hawighorst, Ann-Kathrin Schill, Angelo Boutalikakis, Lorenz Graf-Vlachy, Andreas König, Björn W. Schuller

    Abstract: Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  43. arXiv:2406.15119  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation

    Authors: Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller

    Abstract: Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment… ▽ More

    Submitted 29 May, 2025; v1 submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2025

  44. arXiv:2406.10275  [pdf, other

    cs.CL

    ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

    Authors: Shahin Amiriparian, Filip Packań, Maurice Gerczuk, Björn W. Schuller

    Abstract: Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: accepted at INTERSPEECH 2024

    MSC Class: 68T10 ACM Class: I.2

  45. arXiv:2406.08517  [pdf, other

    eess.AS cs.SD

    DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition

    Authors: Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Bjoern Schuller

    Abstract: In ornithology, bird species are known to have variedit's widely acknowledged that bird species display diverse dialects in their calls across different regions. Consequently, computational methods to identify bird species onsolely through their calls face critsignificalnt challenges. There is growing interest in understanding the impact of species-specific dialects on the effectiveness of bird sp… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: accepted by Interspeech 2024

  46. arXiv:2406.07753  [pdf, ps, other

    cs.AI cs.CL

    The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition

    Authors: Shahin Amiriparian, Lukas Christ, Alexander Kathan, Maurice Gerczuk, Niklas Müller, Steffen Klug, Lukas Stappen, Andreas König, Erik Cambria, Björn Schuller, Simone Eulitz

    Abstract: The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems: In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals such as assertiveness, dominance, likability, and sincerity based on the provided audio-visual data. The Cross-Cultural Humor Detection… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    MSC Class: 68T10 ACM Class: I.2

  47. arXiv:2406.07203  [pdf, other

    cs.SD eess.AS

    ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks

    Authors: Xin Jing, Andreas Triantafyllopoulos, Björn Schuller

    Abstract: Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for ge… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  48. Enrolment-based personalisation for improving individual-level fairness in speech emotion recognition

    Authors: Andreas Triantafyllopoulos, Björn Schuller

    Abstract: The expression of emotion is highly individualistic. However, contemporary speech emotion recognition (SER) systems typically rely on population-level models that adopt a `one-size-fits-all' approach for predicting emotion. Moreover, standard evaluation practices measure performance also on the population level, thus failing to characterise how models work across different speakers. In the present… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  49. INTERSPEECH 2009 Emotion Challenge Revisited: Benchmarking 15 Years of Progress in Speech Emotion Recognition

    Authors: Andreas Triantafyllopoulos, Anton Batliner, Simon Rampp, Manuel Milling, Björn Schuller

    Abstract: We revisit the INTERSPEECH 2009 Emotion Challenge -- the first ever speech emotion recognition (SER) challenge -- and evaluate a series of deep learning models that are representative of the major advances in SER research in the time since then. We start by training each model using a fixed set of hyperparameters, and further fine-tune the best-performing models of that initial setup with a grid s… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  50. Sustained Vowels for Pre- vs Post-Treatment COPD Classification

    Authors: Andreas Triantafyllopoulos, Anton Batliner, Wolfgang Mayr, Markus Fendler, Florian Pokorny, Maurice Gerczuk, Shahin Amiriparian, Thomas Berghaus, Björn Schuller

    Abstract: Chronic obstructive pulmonary disease (COPD) is a serious inflammatory lung disease affecting millions of people around the world. Due to an obstructed airflow from the lungs, it also becomes manifest in patients' vocal behaviour. Of particular importance is the detection of an exacerbation episode, which marks an acute phase and often requires hospitalisation and treatment. Previous work has show… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024