Search | arXiv e-print repository

RiTTA: Modeling Event Relations in Text-to-Audio Generation

Authors: Yuhang He, Yash Jain, Xubo Liu, Andrew Markham, Vibhav Vineet

Abstract: Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work… ▽ More Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models ability to model audio events relation. Code is available at: https://github.com/yuhanghe01/RiTTA △ Less

Submitted 4 January, 2025; v1 submitted 20 December, 2024; originally announced December 2024.

Comments: Project Site: https://yuhanghe01.github.io/RiTTA-Proj/. Code: https://github.com/yuhanghe01/RiTTA

arXiv:2207.01398 [pdf, other]

Large-scale Robustness Analysis of Video Action Recognition Models

Authors: Madeline Chantry Schiappa, Naman Biyani, Prudvi Kamtam, Shruti Vyas, Hamid Palangi, Vibhav Vineet, Yogesh Rawat

Abstract: We have seen a great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) and some recent transformer based approaches which provide top performance on existing benchmarks. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against real-world d… ▽ More We have seen a great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) and some recent transformer based approaches which provide top performance on existing benchmarks. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against real-world distribution shift perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB51-P, UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study robustness of six state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2; suggesting the importance of temporal information for action recognition varies based on the dataset and activities. Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF101-DS, which contains realistic distribution shifts, to further validate some of these findings. We believe this study will serve as a benchmark for future research in robust video action recognition. △ Less

Submitted 7 April, 2023; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: Accepted in 2023 Conference on Computer Vision and Pattern Recognition (CVPR)

arXiv:2010.10691 [pdf, other]

Prediction of Object Geometry from Acoustic Scattering Using Convolutional Neural Networks

Authors: Ziqi Fan, Vibhav Vineet, Chenshen Lu, T. W. Wu, Kyla McMullen

Abstract: Acoustic scattering is strongly influenced by boundary geometry of objects over which sound scatters. The present work proposes a method to infer object geometry from scattering features by training convolutional neural networks. The training data is generated from a fast numerical solver developed on CUDA. The complete set of simulations is sampled to generate multiple datasets containing differe… ▽ More Acoustic scattering is strongly influenced by boundary geometry of objects over which sound scatters. The present work proposes a method to infer object geometry from scattering features by training convolutional neural networks. The training data is generated from a fast numerical solver developed on CUDA. The complete set of simulations is sampled to generate multiple datasets containing different amounts of channels and diverse image resolutions. The robustness of our approach in response to data degradation is evaluated by comparing the performance of networks trained using the datasets with varying levels of data degradation. The present work has found that the predictions made from our models match ground truth with high accuracy. In addition, accuracy does not degrade when fewer data channels or lower resolutions are used. △ Less

Submitted 10 February, 2021; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: Accepted by ICASSP 2021

arXiv:1911.01802 [pdf, other]

Fast acoustic scattering using convolutional neural networks

Authors: Ziqi Fan, Vibhav Vineet, Hannes Gamper, Nikunj Raghuvanshi

Abstract: Diffracted scattering and occlusion are important acoustic effects in interactive auralization and noise control applications, typically requiring expensive numerical simulation. We propose training a convolutional neural network to map from a convex scatterer's cross-section to a 2D slice of the resulting spatial loudness distribution. We show that employing a full-resolution residual network for… ▽ More Diffracted scattering and occlusion are important acoustic effects in interactive auralization and noise control applications, typically requiring expensive numerical simulation. We propose training a convolutional neural network to map from a convex scatterer's cross-section to a 2D slice of the resulting spatial loudness distribution. We show that employing a full-resolution residual network for the resulting image-to-image regression problem yields spatially detailed loudness fields with a root-mean-squared error of less than 1 dB, at over 100x speedup compared to full wave simulation. △ Less

Submitted 15 February, 2020; v1 submitted 30 October, 2019; originally announced November 2019.

Comments: Accepted by ICASSP 2020

Showing 1–4 of 4 results for author: Vineet, V