Skip to main content

Showing 1–7 of 7 results for author: Emmanouilidou, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2503.09205   

    cs.MM cs.CL cs.IR cs.SD eess.AS

    Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model

    Authors: Ali Vosoughi, Dimitra Emmanouilidou, Hannes Gamper

    Abstract: Integrating audio and visual data for training multimodal foundational models remains challenging. We present Audio-Video Vector Alignment (AVVA), which aligns audiovisual (AV) scene content beyond mere temporal synchronization via a Large Language Model (LLM)-based data curation pipeline. Specifically, AVVA scores and selects high-quality training clips using Whisper (speech-based audio foundatio… ▽ More

    Submitted 13 March, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

    Comments: We are withdrawing this version due to the need for substantial updates in scope and organization, which affect the clarity and completeness of the manuscript. We plan to submit a revised version that incorporates these changes

    MSC Class: 68T; 68T45; 68T10

  2. arXiv:2409.15545  [pdf, other

    eess.AS cs.CL cs.MM cs.SD

    Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance

    Authors: Yuanchao Li, Azalea Gui, Dimitra Emmanouilidou, Hannes Gamper

    Abstract: The complex nature of musical emotion introduces inherent bias in both recognition and generation, particularly when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside Frechet Audio Distance (FAD), a reference-free evaluation… ▽ More

    Submitted 30 April, 2025; v1 submitted 23 September, 2024; originally announced September 2024.

    Comments: Accepted to ICME 2025

  3. arXiv:2311.01616  [pdf, ps, other

    eess.AS

    Adapting Frechet Audio Distance for Generative Music Evaluation

    Authors: Azalea Gui, Hannes Gamper, Sebastian Braun, Dimitra Emmanouilidou

    Abstract: The growing popularity of generative music models underlines the need for perceptually relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly used for this purpose even though its correlation with perceptual quality is understudied. We show that FAD performance may be hampered by sample size bias, poor choice of audio embeddings, or the use of biased or low-quality… ▽ More

    Submitted 5 March, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: Submitted to IEEE ICASSP 2024

  4. arXiv:2309.07372  [pdf, other

    eess.AS cs.SD

    Training Audio Captioning Models without Audio

    Authors: Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

    Abstract: Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a… ▽ More

    Submitted 13 September, 2023; originally announced September 2023.

  5. arXiv:2211.06547  [pdf, other

    eess.AS cs.SD

    Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

    Authors: Sandeep Kothinti, Dimitra Emmanouilidou

    Abstract: The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal translation task that focuses on generating natural descriptions from sound events occurring in an audio stream. In this work, we identify and improve on three main c… ▽ More

    Submitted 3 May, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

    Comments: Submitted to EUSIPCO 2023

  6. arXiv:2001.03896  [pdf, other

    cs.SD cs.LG eess.AS

    CURE Dataset: Ladder Networks for Audio Event Classification

    Authors: Harishchandra Dubey, Dimitra Emmanouilidou, Ivan J. Tashev

    Abstract: Audio event classification is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss. We propose a ladder netwo… ▽ More

    Submitted 12 January, 2020; originally announced January 2020.

    Comments: 6 pages, 2 Figures

  7. arXiv:1911.00566  [pdf, other

    eess.AS cs.SD

    Predicting word error rate for reverberant speech

    Authors: Hannes Gamper, Dimitra Emmanouilidou, Sebastian Braun, Ivan J. Tashev

    Abstract: Reverberation negatively impacts the performance of automatic speech recognition (ASR). Prior work on quantifying the effect of reverberation has shown that clarity (C50), a parameter that can be estimated from the acoustic impulse response, is correlated with ASR performance. In this paper we propose predicting ASR performance in terms of the word error rate (WER) directly from acoustic parameter… ▽ More

    Submitted 14 February, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

    Comments: Presented at IEEE 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2020)