Skip to main content

Showing 1–50 of 148 results for author: Schuller, B

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.24493  [pdf, ps, other

    cs.AI cs.SD eess.AS

    MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge

    Authors: Xin Jing, Jiadong Wang, Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

    Abstract: Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, Large Language Models (LLMs) have emerged as a scala… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

  2. arXiv:2505.13814  [pdf, ps, other

    eess.AS cs.AI cs.SD

    Articulatory Feature Prediction from Surface EMG during Speech Production

    Authors: Jihwan Lee, Kevin Huang, Kleanthis Avramidis, Simon Pistrosch, Monica Gonzalez-Machorro, Yoonjeong Lee, Björn Schuller, Louis Goldstein, Shrikanth Narayanan

    Abstract: We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that t… ▽ More

    Submitted 28 May, 2025; v1 submitted 19 May, 2025; originally announced May 2025.

    Comments: Accepted for Interspeech2025

  3. arXiv:2504.20776  [pdf

    cs.SD cs.AI eess.AS

    ECOSoundSet: a finely annotated dataset for the automated acoustic identification of Orthoptera and Cicadidae in North, Central and temperate Western Europe

    Authors: David Funosas, Elodie Massol, Yves Bas, Svenja Schmidt, Dominik Arend, Alexander Gebhard, Luc Barbaro, Sebastian König, Rafael Carbonell Font, David Sannier, Fernand Deroussen, Jérôme Sueur, Christian Roesti, Tomi Trilar, Wolfgang Forstmeier, Lucas Roger, Eloïsa Matheu, Piotr Guzik, Julien Barataud, Laurent Pelozuelo, Stéphane Puissant, Sandra Mueller, Björn Schuller, Jose M. Montoya, Andreas Triantafyllopoulos , et al. (1 additional authors not shown)

    Abstract: Currently available tools for the automated acoustic recognition of European insects in natural soundscapes are limited in scope. Large and ecologically heterogeneous acoustic datasets are currently needed for these algorithms to cross-contextually recognize the subtle and complex acoustic signatures produced by each species, thus making the availability of such datasets a key requisite for their… ▽ More

    Submitted 29 April, 2025; originally announced April 2025.

    Comments: 3 Figures + 2 Supplementary Figures, 2 Tables + 3 Supplementary Tables

  4. arXiv:2503.20919  [pdf, other

    cs.CL cs.SD eess.AS

    GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

    Authors: Yupei Li, Qiyang Sun, Sunil Munthumoduku Krishna Murthy, Emran Alturki, Björn W. Schuller

    Abstract: Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverage… ▽ More

    Submitted 26 March, 2025; originally announced March 2025.

  5. arXiv:2503.09368  [pdf, other

    cs.CV eess.IV

    PerCoV2: Improved Ultra-Low Bit-Rate Perceptual Image Compression with Implicit Hierarchical Masked Image Modeling

    Authors: Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller

    Abstract: We introduce PerCoV2, a novel and open ultra-low bit-rate perceptual image compression system designed for bandwidth- and storage-constrained applications. Building upon prior work by Careil et al., PerCoV2 extends the original formulation to the Stable Diffusion 3 ecosystem and enhances entropy coding efficiency by explicitly modeling the discrete hyper-latent image distribution. To this end, we… ▽ More

    Submitted 12 March, 2025; originally announced March 2025.

  6. arXiv:2501.12122  [pdf, other

    cs.SD eess.AS

    DOTA-ME-CS: Daily Oriented Text Audio-Mandarin English-Code Switching Dataset

    Authors: Yupei Li, Zifan Wei, Heng Yu, Huichi Zhou, Björn W. Schuller

    Abstract: Code-switching, the alternation between two or more languages within communication, poses great challenges for Automatic Speech Recognition (ASR) systems. Existing models and datasets are limited in their ability to effectively handle these challenges. To address this gap and foster progress in code-switching ASR research, we introduce the DOTA-ME-CS: Daily oriented text audio Mandarin-English cod… ▽ More

    Submitted 21 January, 2025; originally announced January 2025.

  7. arXiv:2501.12050  [pdf, other

    cs.LG cs.SD eess.AS

    Representation Learning with Parameterised Quantum Circuits for Advancing Speech Emotion Recognition

    Authors: Thejan Rajapakshe, Rajib Rana, Farina Riaz, Sara Khalifa, Björn W. Schuller

    Abstract: Speech Emotion Recognition (SER) is a complex and challenging task in human-computer interaction due to the intricate dependencies of features and the overlapping nature of emotional expressions conveyed through speech. Although traditional deep learning methods have shown effectiveness, they often struggle to capture subtle emotional variations and overlapping states. This paper introduces a hybr… ▽ More

    Submitted 28 January, 2025; v1 submitted 21 January, 2025; originally announced January 2025.

  8. arXiv:2501.10525  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids

    Authors: Iosif Tsangko, Andreas Triantafyllopoulos, Michael Müller, Hendrik Schröter, Björn W. Schuller

    Abstract: The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper it… ▽ More

    Submitted 23 January, 2025; v1 submitted 17 January, 2025; originally announced January 2025.

    Comments: Comments: Accepted at ICASSP 2025. 5 pages, 3 figures

    ACM Class: I.2.6; H.5.5; I.5.1; I.4.8

  9. arXiv:2501.04292  [pdf, ps, other

    cs.SD cs.AI cs.LG eess.AS

    MADUV: The 1st INTERSPEECH Mice Autism Detection via Ultrasound Vocalization Challenge

    Authors: Zijiang Yang, Meishu Song, Xin Jing, Haojie Zhang, Kun Qian, Bin Hu, Kota Tamada, Toru Takumi, Björn W. Schuller, Yoshiharu Yamamoto

    Abstract: The Mice Autism Detection via Ultrasound Vocalization (MADUV) Challenge introduces the first INTERSPEECH challenge focused on detecting autism spectrum disorder (ASD) in mice through their vocalizations. Participants are tasked with developing models to automatically classify mice as either wild-type or ASD models based on recordings with a high sampling rate. Our baseline system employs a simple… ▽ More

    Submitted 31 May, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

    Comments: 5 pages, 1 figure and 2 tables. Submitted to INTERSPEECH 2025. For MADUV Challenge 2025

  10. arXiv:2412.13421  [pdf, other

    cs.SD eess.AS

    Detecting Machine-Generated Music with Explainability -- A Challenge and Early Benchmarks

    Authors: Yupei Li, Qiyang Sun, Hanqian Li, Lucia Specia, Björn W. Schuller

    Abstract: Machine-generated music (MGM) has become a groundbreaking innovation with wide-ranging applications, such as music therapy, personalised editing, and creative inspiration within the music industry. However, the unregulated proliferation of MGM presents considerable challenges to the entertainment, education, and arts sectors by potentially undermining the value of high-quality human compositions.… ▽ More

    Submitted 17 December, 2024; originally announced December 2024.

  11. arXiv:2412.11943  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    autrainer: A Modular and Extensible Deep Learning Toolkit for Computer Audition Tasks

    Authors: Simon Rampp, Andreas Triantafyllopoulos, Manuel Milling, Björn W. Schuller

    Abstract: This work introduces the key operating principles for autrainer, our new deep learning training framework for computer audition tasks. autrainer is a PyTorch-based toolkit that allows for rapid, reproducible, and easily extensible training on a variety of different computer audition tasks. Concretely, autrainer offers low-code training and supports a wide range of neural networks as well as prepro… ▽ More

    Submitted 10 April, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

  12. arXiv:2412.11795  [pdf, other

    cs.CL cs.SD eess.AS

    ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis

    Authors: Xiangheng He, Junjie Chen, Zixing Zhang, Björn W. Schuller

    Abstract: Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a f… ▽ More

    Submitted 19 December, 2024; v1 submitted 16 December, 2024; originally announced December 2024.

    Comments: Accepted by AAAI 2025

  13. arXiv:2412.06001  [pdf, other

    cs.SD cs.MM eess.AS

    M6: Multi-generator, Multi-domain, Multi-lingual and cultural, Multi-genres, Multi-instrument Machine-Generated Music Detection Databases

    Authors: Yupei Li, Hanqian Li, Lucia Specia, Björn W. Schuller

    Abstract: Machine-generated music (MGM) has emerged as a powerful tool with applications in music therapy, personalised editing, and creative inspiration for the music community. However, its unregulated use threatens the entertainment, education, and arts sectors by diminishing the value of high-quality human compositions. Detecting machine-generated music (MGMD) is, therefore, critical to safeguarding the… ▽ More

    Submitted 8 December, 2024; originally announced December 2024.

  14. arXiv:2412.00571  [pdf, other

    cs.SD eess.AS

    From Audio Deepfake Detection to AI-Generated Music Detection -- A Pathway and Overview

    Authors: Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller

    Abstract: As Artificial Intelligence (AI) technologies continue to evolve, their use in generating realistic, contextually appropriate content has expanded into various domains. Music, an art form and medium for entertainment, deeply rooted into human culture, is seeing an increased involvement of AI into its production. However, despite the effective application of AI music generation (AIGM) tools, the unr… ▽ More

    Submitted 10 December, 2024; v1 submitted 30 November, 2024; originally announced December 2024.

  15. arXiv:2412.00312  [pdf, other

    cs.SD cs.AI eess.AS

    Raw Audio Classification with Cosine Convolutional Neural Network (CosCovNN)

    Authors: Kazi Nazmul Haque, Rajib Rana, Tasnim Jarin, Bjorn W. Schuller Jr

    Abstract: This study explores the field of audio classification from raw waveform using Convolutional Neural Networks (CNNs), a method that eliminates the need for extracting specialised features in the pre-processing step. Unlike recent trends in literature, which often focuses on designing frontends or filters for only the initial layers of CNNs, our research introduces the Cosine Convolutional Neural Net… ▽ More

    Submitted 29 November, 2024; originally announced December 2024.

  16. arXiv:2411.11541  [pdf, other

    cs.SD eess.AS

    Using voice analysis as an early indicator of risk for depression in young adults

    Authors: Klaus R. Scherer, Felix Burkhardt, Uwe D. Reichel, Florian Eyben, Björn W. Schuller

    Abstract: Increasingly frequent publications in the literature report voice quality differences between depressed patients and controls. Here, we examine the possibility of using voice analysis as an early warning signal for the development of emotion disturbances in young adults. As part of a major interdisciplinary European research project in four countries (ECoWeB), examining the effects of web-based pr… ▽ More

    Submitted 18 November, 2024; originally announced November 2024.

    Comments: Submitted to ToaC

  17. arXiv:2410.11120  [pdf, other

    cs.SD cs.AI eess.AS

    Audio-based Kinship Verification Using Age Domain Conversion

    Authors: Qiyang Sun, Alican Akman, Xin Jing, Manuel Milling, Björn W. Schuller

    Abstract: Audio-based kinship verification (AKV) is important in many domains, such as home security monitoring, forensic identification, and social network analysis. A key challenge in the task arises from differences in age across samples from different individuals, which can be interpreted as a domain bias in a cross-domain verification task. To address this issue, we design the notion of an "age-standar… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: 4 pages, 2 figures, submitted to IEEE Signal Processing Letters

    MSC Class: 68T10 ACM Class: I.5.4; I.2.6

  18. arXiv:2410.07530  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Explanation Synthesis with Generative Foundation Models

    Authors: Alican Akman, Qiyang Sun, Björn W. Schuller

    Abstract: The increasing success of audio foundation models across various tasks has led to a growing need for improved interpretability to understand their intricate decision-making processes better. Existing methods primarily focus on explaining these models by attributing importance to elements within the input space based on their influence on the final decision. In this paper, we introduce a novel audi… ▽ More

    Submitted 9 October, 2024; originally announced October 2024.

  19. arXiv:2409.06451  [pdf, other

    cs.SD eess.AS

    Enhancing Emotional Text-to-Speech Controllability with Natural Language Guidance through Contrastive Learning and Diffusion Models

    Authors: Xin Jing, Kun Zhou, Andreas Triantafyllopoulos, Björn W. Schuller

    Abstract: While current emotional text-to-speech (TTS) systems can generate highly intelligible emotional speech, achieving fine control over emotion rendering of the output speech still remains a significant challenge. In this paper, we introduce ParaEVITS, a novel emotional TTS framework that leverages the compositionality of natural language to enhance control over emotional rendering. By incorporating a… ▽ More

    Submitted 10 September, 2024; originally announced September 2024.

  20. arXiv:2408.13920  [pdf, other

    cs.SD eess.AS

    Wav2Small: Distilling Wav2Vec2 to 72K parameters for Low-Resource Speech emotion recognition

    Authors: Dionyssos Kounadis-Bastian, Oliver Schrüfer, Anna Derington, Hagen Wierstorf, Florian Eyben, Felix Burkhardt, Björn Schuller

    Abstract: Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of annotator opinions. However, Concordance Correlati… ▽ More

    Submitted 22 November, 2024; v1 submitted 25 August, 2024; originally announced August 2024.

    Comments: apply review

  21. arXiv:2408.06264  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Enhancement for Computer Audition -- An Iterative Training Paradigm Using Sample Importance

    Authors: Manuel Milling, Shuo Liu, Andreas Triantafyllopoulos, Ilhan Aslan, Björn W. Schuller

    Abstract: Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solu… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  22. Abusive Speech Detection in Indic Languages Using Acoustic Features

    Authors: Anika A. Spiesberger, Andreas Triantafyllopoulos, Iosif Tsangko, Björn W. Schuller

    Abstract: Abusive content in online social networks is a well-known problem that can cause serious psychological harm and incite hatred. The ability to upload audio data increases the importance of developing methods to detect abusive content in speech recordings. However, simply transferring the mechanisms from written abuse detection would ignore relevant information such as emotion and tone. In addition,… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Journal ref: Proc. INTERSPEECH 2023, 2683-2687

  23. arXiv:2407.15672  [pdf, other

    cs.SD eess.AS

    Computer Audition: From Task-Specific Machine Learning to Foundation Models

    Authors: Andreas Triantafyllopoulos, Iosif Tsangko, Alexander Gebhard, Annamaria Mesaros, Tuomas Virtanen, Björn Schuller

    Abstract: Foundation models (FMs) are increasingly spearheading recent advances on a variety of tasks that fall under the purview of computer audition -- the use of machines to understand sounds. They feature several advantages over traditional pipelines: among others, the ability to consolidate multiple tasks in a single model, the option to leverage knowledge from other modalities, and the readily-availab… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  24. arXiv:2407.11012  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Exploring Gender-Specific Speech Patterns in Automatic Suicide Risk Assessment

    Authors: Maurice Gerczuk, Shahin Amiriparian, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Björn W. Schuller

    Abstract: In emergency medicine, timely intervention for patients at risk of suicide is often hindered by delayed access to specialised psychiatric care. To bridge this gap, we introduce a speech-based approach for automatic suicide risk assessment. Our study involves a novel dataset comprising speech recordings of 20 patients who read neutral texts. We extract four speech representations encompassing inter… ▽ More

    Submitted 26 June, 2024; originally announced July 2024.

    Comments: accepted at INTERSPEECH 2024

    MSC Class: 68T10 ACM Class: J.3

  25. arXiv:2407.01143  [pdf, other

    cs.SD cs.AI eess.AS

    Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition

    Authors: Oliver Schrüfer, Manuel Milling, Felix Burkhardt, Florian Eyben, Björn Schuller

    Abstract: Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Rel… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: accepted for Interspeech 2024, 5 pages

  26. arXiv:2406.17667  [pdf, other

    cs.SD cs.CL eess.AS

    This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach

    Authors: Lukas Christ, Shahin Amiriparian, Friederike Hawighorst, Ann-Kathrin Schill, Angelo Boutalikakis, Lorenz Graf-Vlachy, Andreas König, Björn W. Schuller

    Abstract: Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  27. arXiv:2406.15119  [pdf, ps, other

    cs.SD cs.AI eess.AS

    Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation

    Authors: Yi Chang, Zhao Ren, Zhonghao Zhao, Thanh Tam Nguyen, Kun Qian, Tanja Schultz, Björn W. Schuller

    Abstract: Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment… ▽ More

    Submitted 29 May, 2025; v1 submitted 21 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2025

  28. arXiv:2406.08517  [pdf, other

    eess.AS cs.SD

    DB3V: A Dialect Dominated Dataset of Bird Vocalisation for Cross-corpus Bird Species Recognition

    Authors: Xin Jing, Luyang Zhang, Jiangjian Xie, Alexander Gebhard, Alice Baird, Bjoern Schuller

    Abstract: In ornithology, bird species are known to have variedit's widely acknowledged that bird species display diverse dialects in their calls across different regions. Consequently, computational methods to identify bird species onsolely through their calls face critsignificalnt challenges. There is growing interest in understanding the impact of species-specific dialects on the effectiveness of bird sp… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: accepted by Interspeech 2024

  29. arXiv:2406.07203  [pdf, other

    cs.SD eess.AS

    ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks

    Authors: Xin Jing, Andreas Triantafyllopoulos, Björn Schuller

    Abstract: Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for ge… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  30. Audio-based Step-count Estimation for Running -- Windowing and Neural Network Baselines

    Authors: Philipp Wagner, Andreas Triantafyllopoulos, Alexander Gebhard, Björn Schuller

    Abstract: In recent decades, running has become an increasingly popular pastime activity due to its accessibility, ease of practice, and anticipated health benefits. However, the risk of running-related injuries is substantial for runners of different experience levels. Several common forms of injuries result from overuse -- extending beyond the recommended running time and intensity. Recently, audio-based… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted at EUSIPCO 2024

  31. An automatic analysis of ultrasound vocalisations for the prediction of interaction context in captive Egyptian fruit bats

    Authors: Andreas Triantafyllopoulos, Alexander Gebhard, Manuel Milling, Simon Rampp, Björn Schuller

    Abstract: Prior work in computational bioacoustics has mostly focused on the detection of animal presence in a particular habitat. However, animal sounds contain much richer information than mere presence; among others, they encapsulate the interactions of those animals with other members of their species. Studying these interactions is almost impossible in a naturalistic setting, as the ground truth is oft… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: Accepted at EUSIPCO 2024

  32. arXiv:2405.03953  [pdf, other

    cs.SD eess.AS

    Intelligent Cardiac Auscultation for Murmur Detection via Parallel-Attentive Models with Uncertainty Estimation

    Authors: Zixing Zhang, Tao Pang, Jing Han, Björn W. Schuller

    Abstract: Heart murmurs are a common manifestation of cardiovascular diseases and can provide crucial clues to early cardiac abnormalities. While most current research methods primarily focus on the accuracy of models, they often overlook other important aspects such as the interpretability of machine learning algorithms and the uncertainty of predictions. This paper introduces a heart murmur detection meth… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Journal ref: published at ICASSP 2024

  33. arXiv:2405.03952  [pdf, other

    cs.SD cs.CL eess.AS

    HAFFormer: A Hierarchical Attention-Free Framework for Alzheimer's Disease Detection From Spontaneous Speech

    Authors: Zhongren Dong, Zixing Zhang, Weixiang Xu, Jing Han, Jianjun Ou, Björn W. Schuller

    Abstract: Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

    Journal ref: publised at ICASSP 2024

  34. arXiv:2404.12132  [pdf, other

    cs.SD cs.CL eess.AS

    Non-Invasive Suicide Risk Prediction Through Speech Analysis

    Authors: Shahin Amiriparian, Maurice Gerczuk, Justina Lutz, Wolfgang Strube, Irina Papazova, Alkomiet Hasan, Alexander Kathan, Björn W. Schuller

    Abstract: The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we collected… ▽ More

    Submitted 30 October, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

    ACM Class: I.2

  35. arXiv:2403.14083  [pdf, other

    cs.SD cs.LG eess.AS

    emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

    Authors: Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso

    Abstract: Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a pot… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE Transactions on Affective Computing on February 19, 2024. arXiv admin note: text overlap with arXiv:2305.14402

  36. arXiv:2402.01227  [pdf, other

    cs.SD cs.AI cs.HC eess.AS

    STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition

    Authors: Yi Chang, Zhao Ren, Zixing Zhang, Xin Jing, Kun Qian, Xi Shao, Bin Hu, Tanja Schultz, Björn W. Schuller

    Abstract: Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  37. arXiv:2401.12925  [pdf, other

    cs.SD eess.AS

    Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

    Authors: Yan Zhao, Jincen Wang, Cheng Lu, Sunan Li, Björn Schuller, Yuan Zong, Wenming Zheng

    Abstract: Cross-corpus speech emotion recognition (SER) aims to transfer emotional knowledge from a labeled source corpus to an unlabeled corpus. However, prior methods require access to source data during adaptation, which is unattainable in real-life scenarios due to data privacy protection concerns. This paper tackles a more practical task, namely source-free cross-corpus SER, where a pre-trained source… ▽ More

    Submitted 23 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  38. arXiv:2401.09752  [pdf, other

    cs.SD cs.LG eess.AS

    Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution Adaptation

    Authors: Cheng Lu, Yuan Zong, Hailun Lian, Yan Zhao, Björn Schuller, Wenming Zheng

    Abstract: In speaker-independent speech emotion recognition, the training and testing samples are collected from diverse speakers, leading to a multi-domain shift challenge across the feature distributions of data from different speakers. Consequently, when the trained model is confronted with data from new speakers, its performance tends to degrade. To address the issue, we propose a Dynamic Joint Distribu… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: Accepted by ICASSP 2024

  39. arXiv:2312.06270  [pdf, other

    eess.AS cs.SD

    Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models

    Authors: Anna Derington, Hagen Wierstorf, Ali Özkil, Florian Eyben, Felix Burkhardt, Björn W. Schuller

    Abstract: Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themse… ▽ More

    Submitted 12 February, 2025; v1 submitted 11 December, 2023; originally announced December 2023.

  40. arXiv:2309.16369  [pdf, other

    cs.SD cs.LG eess.AS

    Bringing the Discussion of Minima Sharpness to the Audio Domain: a Filter-Normalised Evaluation for Acoustic Scene Classification

    Authors: Manuel Milling, Andreas Triantafyllopoulos, Iosif Tsangko, Simon David Noel Rampp, Björn Wolfgang Schuller

    Abstract: The correlation between the sharpness of loss minima and generalisation in the context of deep neural networks has been subject to discussion for a long time. Whilst mostly investigated in the context of selected benchmark data sets in the area of computer vision, we explore this aspect for the acoustic scene classification task of the DCASE2020 challenge data. Our analysis is based on two-dimensi… ▽ More

    Submitted 15 January, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

    Comments: This work has been submitted to the IEEE for possible publication

  41. arXiv:2309.15024  [pdf, other

    cs.SD cs.LG eess.AS

    Synthia's Melody: A Benchmark Framework for Unsupervised Domain Adaptation in Audio

    Authors: Chia-Hsin Lin, Charles Jones, Björn W. Schuller, Harry Coppock

    Abstract: Despite significant advancements in deep learning for vision and natural language, unsupervised domain adaptation in audio remains relatively unexplored. We, in part, attribute this to the lack of an appropriate benchmark dataset. To address this gap, we present Synthia's melody, a novel audio data generation framework capable of simulating an infinite variety of 4-second melodies with user-specif… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

  42. Exploring Meta Information for Audio-based Zero-shot Bird Classification

    Authors: Alexander Gebhard, Andreas Triantafyllopoulos, Teresa Bez, Lukas Christ, Alexander Kathan, Björn W. Schuller

    Abstract: Advances in passive acoustic monitoring and machine learning have led to the procurement of vast datasets for computational bioacoustic research. Nevertheless, data scarcity is still an issue for rare and underrepresented species. This study investigates how meta-information can improve zero-shot audio classification, utilising bird species as an example case study due to the availability of rich… ▽ More

    Submitted 11 June, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  43. arXiv:2309.03244  [pdf, other

    eess.IV cs.CV cs.LG

    EGIC: Enhanced Low-Bit-Rate Generative Image Compression Guided by Semantic Segmentation

    Authors: Nikolai Körber, Eduard Kromer, Andreas Siebert, Sascha Hauke, Daniel Mueller-Gritschneder, Björn Schuller

    Abstract: We introduce EGIC, an enhanced generative image compression method that allows traversing the distortion-perception curve efficiently from a single model. EGIC is based on two novel building blocks: i) OASIS-C, a conditional pre-trained semantic segmentation-guided discriminator, which provides both spatially and semantically-aware gradient feedback to the generator, conditioned on the latent imag… ▽ More

    Submitted 16 July, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: ECCV 2024 Camera Ready

  44. arXiv:2308.12792  [pdf, other

    cs.SD eess.AS

    Sparks of Large Audio Models: A Survey and Outlook

    Authors: Siddique Latif, Moazzam Shoukat, Fahad Shamshad, Muhammad Usama, Yi Ren, Heriberto Cuayáhuitl, Wenwu Wang, Xulong Zhang, Roberto Togneri, Erik Cambria, Björn W. Schuller

    Abstract: This survey paper provides a comprehensive overview of the recent advancements and challenges in applying large language models to the field of audio signal processing. Audio processing, with its diverse signal representations and a wide range of sources--from human voices to musical instruments and environmental sounds--poses challenges distinct from those found in traditional Natural Language Pr… ▽ More

    Submitted 21 September, 2023; v1 submitted 24 August, 2023; originally announced August 2023.

    Comments: Under review, Repo URL: https://github.com/EmulationAI/awesome-large-audio-models

  45. arXiv:2308.11773  [pdf

    cs.CL cs.CY cs.SD eess.AS q-bio.QM

    Identifying depression-related topics in smartphone-collected free-response speech recordings using an automatic speech recognition system and a deep learning topic model

    Authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf , et al. (3 additional authors not shown)

    Abstract: Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordi… ▽ More

    Submitted 5 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

  46. arXiv:2307.06090  [pdf, other

    cs.SD eess.AS

    Can Large Language Models Aid in Annotating Speech Emotional Data? Uncovering New Frontiers

    Authors: Siddique Latif, Muhammad Usama, Mohammad Ibrahim Malik, Björn W. Schuller

    Abstract: Despite recent advancements in speech emotion recognition (SER) models, state-of-the-art deep learning (DL) approaches face the challenge of the limited availability of annotated data. Large language models (LLMs) have revolutionised our understanding of natural language, introducing emergent properties that broaden comprehension in language, speech, and vision. This paper examines the potential o… ▽ More

    Submitted 19 June, 2024; v1 submitted 12 July, 2023; originally announced July 2023.

    Comments: Accepted in IEEE Computational Intelligence Magazine

  47. arXiv:2307.02132  [pdf, other

    cs.SD eess.AS

    Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions

    Authors: Felix Burkhardt, Uwe Reichel, Florian Eyben, Björn Schuller

    Abstract: We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotations. Results indicate that with a very simple method… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: accepted at 34th ESSV 2023, Munich 2023

  48. arXiv:2306.16962  [pdf, other

    cs.SD eess.AS

    Speech-based Age and Gender Prediction with Transformers

    Authors: Felix Burkhardt, Johannes Wagner, Hagen Wierstorf, Florian Eyben, Björn Schuller

    Abstract: We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). Compared to a modelling approach built on handcraft… ▽ More

    Submitted 29 June, 2023; originally announced June 2023.

    Comments: 5 pages, submitted to 15th ITG Conference on Speech Communication

  49. arXiv:2305.14402  [pdf, other

    cs.SD cs.LG eess.AS

    Enhancing Speech Emotion Recognition Through Differentiable Architecture Search

    Authors: Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Björn Schuller

    Abstract: Speech Emotion Recognition (SER) is a critical enabler of emotion-aware communication in human-computer interactions. Recent advancements in Deep Learning (DL) have substantially enhanced the performance of SER models through increased model complexity. However, designing optimal DL architectures requires prior experience and experimental evaluations. Encouragingly, Neural Architecture Search (NAS… ▽ More

    Submitted 18 January, 2024; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, 4 figures

  50. arXiv:2305.14023  [pdf, other

    cs.SD eess.AS

    Happy or Evil Laughter? Analysing a Database of Natural Audio Samples

    Authors: Aljoscha Düsterhöft, Felix Burkhardt, Björn W. Schuller

    Abstract: We conducted a data collection on the basis of the Google AudioSet database by selecting a subset of the samples annotated with \textit{laughter}. The selection criterion was to be present a communicative act with clear connotation of being either positive (laughing with) or negative (being laughed at). On the basis of this annotated data, we performed two experiments: on the one hand, we manually… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.