-
Quality Over Quantity? LLM-Based Curation for a Data-Efficient Audio-Video Foundation Model
Authors:
Ali Vosoughi,
Dimitra Emmanouilidou,
Hannes Gamper
Abstract:
Integrating audio and visual data for training multimodal foundational models remains challenging. We present Audio-Video Vector Alignment (AVVA), which aligns audiovisual (AV) scene content beyond mere temporal synchronization via a Large Language Model (LLM)-based data curation pipeline. Specifically, AVVA scores and selects high-quality training clips using Whisper (speech-based audio foundatio…
▽ More
Integrating audio and visual data for training multimodal foundational models remains challenging. We present Audio-Video Vector Alignment (AVVA), which aligns audiovisual (AV) scene content beyond mere temporal synchronization via a Large Language Model (LLM)-based data curation pipeline. Specifically, AVVA scores and selects high-quality training clips using Whisper (speech-based audio foundation model) for audio and DINOv2 for video within a dual-encoder contrastive learning framework. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate that this approach can achieve significant accuracy gains with substantially less curated data. For instance, AVVA yields a 7.6% improvement in top-1 accuracy for audio-to-video retrieval on VGGSound compared to ImageBind, despite training on only 192 hours of carefully filtered data (vs. 5800+ hours). Moreover, an ablation study highlights that trading data quantity for data quality improves performance, yielding respective top-3 accuracy increases of 47.8, 48.4, and 58.0 percentage points on AudioCaps, VALOR, and VGGSound over uncurated baselines. While these results underscore AVVA's data efficiency, we also discuss the overhead of LLM-driven curation and how it may be scaled or approximated in larger domains. Overall, AVVA provides a viable path toward more robust, text-free audiovisual learning with improved retrieval accuracy.
△ Less
Submitted 13 March, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Addressing Emotion Bias in Music Emotion Recognition and Generation with Frechet Audio Distance
Authors:
Yuanchao Li,
Azalea Gui,
Dimitra Emmanouilidou,
Hannes Gamper
Abstract:
The complex nature of musical emotion introduces inherent bias in both recognition and generation, particularly when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside Frechet Audio Distance (FAD), a reference-free evaluation…
▽ More
The complex nature of musical emotion introduces inherent bias in both recognition and generation, particularly when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside Frechet Audio Distance (FAD), a reference-free evaluation metric. Our study begins with a benchmark evaluation of MER, highlighting the limitations of using a single audio encoder and the disparities observed across different measurements. We then propose assessing MER performance using FAD derived from multiple encoders to provide a more objective measure of musical emotion. Furthermore, we introduce an enhanced EMG approach designed to improve both the variability and prominence of generated musical emotion, thereby enhancing its realism. Additionally, we investigate the differences in realism between the emotions conveyed in real and synthetic music, comparing our EMG model against two baseline models. Experimental results underscore the issue of emotion bias in both MER and EMG and demonstrate the potential of using FAD and diverse audio encoders to evaluate musical emotion more objectively and effectively.
△ Less
Submitted 30 April, 2025; v1 submitted 23 September, 2024;
originally announced September 2024.
-
Adapting Frechet Audio Distance for Generative Music Evaluation
Authors:
Azalea Gui,
Hannes Gamper,
Sebastian Braun,
Dimitra Emmanouilidou
Abstract:
The growing popularity of generative music models underlines the need for perceptually relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly used for this purpose even though its correlation with perceptual quality is understudied. We show that FAD performance may be hampered by sample size bias, poor choice of audio embeddings, or the use of biased or low-quality…
▽ More
The growing popularity of generative music models underlines the need for perceptually relevant, objective music quality metrics. The Frechet Audio Distance (FAD) is commonly used for this purpose even though its correlation with perceptual quality is understudied. We show that FAD performance may be hampered by sample size bias, poor choice of audio embeddings, or the use of biased or low-quality reference sets. We propose reducing sample size bias by extrapolating scores towards an infinite sample size. Through comparisons with MusicCaps labels and a listening test we identify audio embeddings and music reference sets that yield FAD scores well-correlated with acoustic and musical quality. Our results suggest that per-song FAD can be useful to identify outlier samples and predict perceptual quality for a range of music sets and generative models. Finally, we release a toolkit that allows adapting FAD for generative music evaluation.
△ Less
Submitted 5 March, 2024; v1 submitted 2 November, 2023;
originally announced November 2023.
-
Training Audio Captioning Models without Audio
Authors:
Soham Deshmukh,
Benjamin Elizalde,
Dimitra Emmanouilidou,
Bhiksha Raj,
Rita Singh,
Huaming Wang
Abstract:
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an a…
▽ More
Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics
Authors:
Sandeep Kothinti,
Dimitra Emmanouilidou
Abstract:
The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal translation task that focuses on generating natural descriptions from sound events occurring in an audio stream. In this work, we identify and improve on three main c…
▽ More
The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal translation task that focuses on generating natural descriptions from sound events occurring in an audio stream. In this work, we identify and improve on three main challenges in automated audio captioning: i) data scarcity, ii) imbalance or limitations in the audio captions vocabulary, and iii) the proper performance evaluation metric that can best capture both auditory and semantic characteristics. We find that generally adopted loss functions can result in an unfair vocabulary imbalance during model training. We propose two audio captioning augmentation methods that enrich the training dataset and the vocabulary size. We further underline the need for in-domain pretraining by exploring the suitability of audio encoders that were previously trained on different audio tasks. Finally, we systematically explore five performance metrics borrowed from the image captioning domain and highlight their limitations for the audio domain.
△ Less
Submitted 3 May, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
CURE Dataset: Ladder Networks for Audio Event Classification
Authors:
Harishchandra Dubey,
Dimitra Emmanouilidou,
Ivan J. Tashev
Abstract:
Audio event classification is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss. We propose a ladder netwo…
▽ More
Audio event classification is an important task for several applications such as surveillance, audio, video and multimedia retrieval etc. There are approximately 3M people with hearing loss who can't perceive events happening around them. This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss. We propose a ladder network based audio event classifier that utilizes 5s sound recordings derived from the Freesound project. We adopted the state-of-the-art convolutional neural network (CNN) embeddings as audio features for this task. We also investigate extreme learning machine (ELM) for event classification. In this study, proposed classifiers are compared with support vector machine (SVM) baseline. We propose signal and feature normalization that aims to reduce the mismatch between different recordings scenarios. Firstly, CNN is trained on weakly labeled Audioset data. Next, the pre-trained model is adopted as feature extractor for proposed CURE corpus. We incorporate ESC-50 dataset as second evaluation set. Results and discussions validate the superiority of Ladder network over ELM and SVM classifier in terms of robustness and increased classification accuracy. While Ladder network is robust to data mismatches, simpler SVM and ELM classifiers are sensitive to such mismatches, where the proposed normalization techniques can play an important role. Experimental studies with ESC-50 and CURE corpora elucidate the differences in dataset complexity and robustness offered by proposed approaches.
△ Less
Submitted 12 January, 2020;
originally announced January 2020.
-
Predicting word error rate for reverberant speech
Authors:
Hannes Gamper,
Dimitra Emmanouilidou,
Sebastian Braun,
Ivan J. Tashev
Abstract:
Reverberation negatively impacts the performance of automatic speech recognition (ASR). Prior work on quantifying the effect of reverberation has shown that clarity (C50), a parameter that can be estimated from the acoustic impulse response, is correlated with ASR performance. In this paper we propose predicting ASR performance in terms of the word error rate (WER) directly from acoustic parameter…
▽ More
Reverberation negatively impacts the performance of automatic speech recognition (ASR). Prior work on quantifying the effect of reverberation has shown that clarity (C50), a parameter that can be estimated from the acoustic impulse response, is correlated with ASR performance. In this paper we propose predicting ASR performance in terms of the word error rate (WER) directly from acoustic parameters via a polynomial, sigmoidal, or neural network fit, as well as blindly from reverberant speech samples using a convolutional neural network (CNN). We carry out experiments on two state-of-the-art ASR models and a large set of acoustic impulse responses (AIRs). The results confirm C50 and C80 to be highly correlated with WER, allowing WER to be predicted with the proposed fitting approaches. The proposed non-intrusive CNN model outperforms C50-based WER prediction, indicating that WER can be estimated blindly, i.e., directly from the reverberant speech samples without knowledge of the acoustic parameters.
△ Less
Submitted 14 February, 2020; v1 submitted 1 November, 2019;
originally announced November 2019.