-
Fine-Grained control over Music Generation with Activation Steering
Authors:
Dipanshu Panda,
Jayden Koshy Joe,
Harshith M R,
Swathi Narashiman,
Pranay Mathur,
Anish Veerakumar,
Aniruddh Krishna,
Keerthiharan A
Abstract:
We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner. We observe t…
▽ More
We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner. We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation. Audio samples illustrating our method are available at our demo page.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
Zero-resource Speech Translation and Recognition with LLMs
Authors:
Karel Mundnich,
Xing Niu,
Prashant Mathur,
Srikanth Ronanki,
Brady Houston,
Veera Raghavendra Elluru,
Nilaksh Das,
Zejiang Hou,
Goeric Huybrechts,
Anshu Bhatia,
Daniel Garcia-Romero,
Kyu J. Han,
Katrin Kirchhoff
Abstract:
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a m…
▽ More
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2\%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
△ Less
Submitted 30 December, 2024; v1 submitted 24 December, 2024;
originally announced December 2024.
-
Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
Authors:
Lucas Goncalves,
Prashant Mathur,
Xing Niu,
Brady Houston,
Chandrashekhar Lavania,
Srikanth Vishnubhotla,
Lijia Sun,
Anthony Ferritto
Abstract:
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been lar…
▽ More
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video, without any degradation in translation quality.
△ Less
Submitted 21 December, 2024;
originally announced December 2024.
-
SpeechVerse: A Large-scale Generalizable Audio Language Model
Authors:
Nilaksh Das,
Saket Dingliwal,
Srikanth Ronanki,
Rohit Paturi,
Zhaocheng Huang,
Prashant Mathur,
Jie Yuan,
Dhanush Bekal,
Xing Niu,
Sai Muralidhar Jayanthi,
Xilai Li,
Karel Mundnich,
Monica Sunkara,
Sravan Bodapati,
Sundararajan Srinivasan,
Kyu J Han,
Katrin Kirchhoff
Abstract:
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore devel…
▽ More
Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.
△ Less
Submitted 24 March, 2025; v1 submitted 13 May, 2024;
originally announced May 2024.
-
PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores
Authors:
Lucas Goncalves,
Prashant Mathur,
Chandrashekhar Lavania,
Metehan Cekic,
Marcello Federico,
Kyu J. Han
Abstract:
Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately…
▽ More
Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally accepted evaluation metrics also play an important role in advancing the field. While there are many metrics available to evaluate audio and visual content separately, there is a lack of metrics that offer a quantitative and interpretable measure of audio-visual synchronization for videos "in the wild". To address this gap, we first created a large scale human annotated dataset (100+ hrs) representing nine types of synchronization errors in audio-visual content and how human perceive them. We then developed a PEAVS (Perceptual Evaluation of Audio-Visual Synchrony) score, a novel automatic metric with a 5-point scale that evaluates the quality of audio-visual synchronization. We validate PEAVS using a newly generated dataset, achieving a Pearson correlation of 0.79 at the set level and 0.54 at the clip level when compared to human labels. In our experiments, we observe a relative gain 50% over a natural extension of Fréchet based metrics for Audio-Visual synchrony, confirming PEAVS efficacy in objectively modeling subjective perceptions of audio-visual synchronization for videos "in the wild".
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation
Authors:
Juan Zuluaga-Gomez,
Zhaocheng Huang,
Xing Niu,
Rohit Paturi,
Sundararajan Srinivasan,
Prashant Mathur,
Brian Thompson,
Marcello Federico
Abstract:
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combin…
▽ More
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters
Authors:
Proyag Pal,
Brian Thompson,
Yogesh Virkar,
Prashant Mathur,
Alexandra Chronopoulou,
Marcello Federico
Abstract:
To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while…
▽ More
To translate speech for automatic dubbing, machine translation needs to be isochronous, i.e. translated speech needs to be aligned with the source in terms of speech durations. We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We also introduce auxiliary counters to help the decoder to keep track of the timing information while generating target phonemes. We show that our model improves translation quality and isochrony compared to previous work where the translation model is instead trained to predict interleaved sequences of phonemes and durations.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing
Authors:
Alexandra Chronopoulou,
Brian Thompson,
Prashant Mathur,
Yogesh Virkar,
Surafel M. Lakew,
Marcello Federico
Abstract:
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the spe…
▽ More
Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video, including mouth movements, pauses, hand gestures, etc. In this paper, we propose training a model that directly optimizes both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Dynamic Graph Modeling of Simultaneous EEG and Eye-tracking Data for Reading Task Identification
Authors:
Puneet Mathur,
Trisha Mittal,
Dinesh Manocha
Abstract:
We present a new approach, that we call AdaGTCN, for identifying human reader intent from Electroencephalogram~(EEG) and Eye movement~(EM) data in order to help differentiate between normal reading and task-oriented reading. Understanding the physiological aspects of the reading process~(the cognitive load and the reading intent) can help improve the quality of crowd-sourced annotated data. Our me…
▽ More
We present a new approach, that we call AdaGTCN, for identifying human reader intent from Electroencephalogram~(EEG) and Eye movement~(EM) data in order to help differentiate between normal reading and task-oriented reading. Understanding the physiological aspects of the reading process~(the cognitive load and the reading intent) can help improve the quality of crowd-sourced annotated data. Our method, Adaptive Graph Temporal Convolution Network (AdaGTCN), uses an Adaptive Graph Learning Layer and Deep Neighborhood Graph Convolution Layer for identifying the reading activities using time-locked EEG sequences recorded during word-level eye-movement fixations. Adaptive Graph Learning Layer dynamically learns the spatial correlations between the EEG electrode signals while the Deep Neighborhood Graph Convolution Layer exploits temporal features from a dense graph neighborhood to establish the state of the art in reading task identification over other contemporary approaches. We compare our approach with several baselines to report an improvement of 6.29% on the ZuCo 2.0 dataset, along with extensive ablation experiments
△ Less
Submitted 21 February, 2021;
originally announced February 2021.
-
Smart Laptop Bag with Machine Learning for Activity Recognition
Authors:
Dwij Sukeshkumar Sheth,
Shantanu Singh,
Prakhar S Mathur,
Vydeki D
Abstract:
In todays world of smart living, the smart laptop bag, presented in this paper, provides a better solution to keep track of our precious possessions and monitoring them in real time. As the world moves towards a much tech-savvy direction, the novel laptop bag discussed here facilitates the user to perform location tracking, ambiance monitoring, user-state monitoring etc. in one device. The innovat…
▽ More
In todays world of smart living, the smart laptop bag, presented in this paper, provides a better solution to keep track of our precious possessions and monitoring them in real time. As the world moves towards a much tech-savvy direction, the novel laptop bag discussed here facilitates the user to perform location tracking, ambiance monitoring, user-state monitoring etc. in one device. The innovative design uses cloud computing and machine learning algorithms to monitor the health of the user and many parameters of the bag. The emergency alert system in this bag could be trained to send appropriate notifications to emergency contacts of the user, in case of abnormal health conditions or theft of the bag. The experimental smart laptop bag uses deep neural network, which was trained and tested over the various parameters from the bag and produces above 95% accurate results.
△ Less
Submitted 14 April, 2019;
originally announced April 2019.