Skip to main content

Showing 1–6 of 6 results for author: Arushi

Searching in archive eess. Search in all archives.
.
  1. arXiv:2412.19351  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    ETTA: Elucidating the Design Space of Text-to-Audio Models

    Authors: Sang-gil Lee, Zhifeng Kong, Arushi Goel, Sungwon Kim, Rafael Valle, Bryan Catanzaro

    Abstract: Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic under… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

  2. arXiv:2412.00760  [pdf, other

    eess.AS cs.AI cs.CL cs.ET cs.LG

    Automating Feedback Analysis in Surgical Training: Detection, Categorization, and Assessment

    Authors: Firdavs Nasriddinov, Rafal Kocielnik, Arushi Gupta, Cherine Yang, Elyssa Wong, Anima Anandkumar, Andrew Hung

    Abstract: This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings, which is crucial for characterizing teaching tasks. In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition. However, a… ▽ More

    Submitted 1 December, 2024; originally announced December 2024.

    Comments: Accepted as a proceedings paper at Machine Learning for Health 2024

    MSC Class: 68T50; 68U99; 68T99 ACM Class: I.2; I.2.7; I.5.4; J.3; K.3.1

  3. arXiv:2404.07616  [pdf, other

    cs.CL cs.SD eess.AS

    Audio Dialogues: Dialogues dataset for audio and music understanding

    Authors: Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro

    Abstract: Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dial… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Demo website: https://audiodialogues.github.io/

  4. arXiv:2402.01831  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

    Authors: Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

    Abstract: Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) stro… ▽ More

    Submitted 28 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  5. arXiv:2210.06257  [pdf, other

    cs.CV cs.LG eess.IV

    What can we learn about a generated image corrupting its latent representation?

    Authors: Agnieszka Tomczak, Aarushi Gupta, Slobodan Ilic, Nassir Navab, Shadi Albarqouni

    Abstract: Generative adversarial networks (GANs) offer an effective solution to the image-to-image translation problem, thereby allowing for new possibilities in medical imaging. They can translate images from one imaging modality to another at a low cost. For unpaired datasets, they rely mostly on cycle loss. Despite its effectiveness in learning the underlying data distribution, it can lead to a discrepan… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

  6. arXiv:2208.01041  [pdf

    eess.AS cs.HC cs.LG cs.MM cs.SD

    Voice Analysis for Stress Detection and Application in Virtual Reality to Improve Public Speaking in Real-time: A Review

    Authors: Arushi, Roberto Dillon, Ai Ni Teoh, Denise Dillon

    Abstract: Stress during public speaking is common and adversely affects performance and self-confidence. Extensive research has been carried out to develop various models to recognize emotional states. However, minimal research has been conducted to detect stress during public speaking in real time using voice analysis. In this context, the current review showed that the application of algorithms was not pr… ▽ More

    Submitted 31 July, 2022; originally announced August 2022.

    Comments: 41 pages, 7 figures, 4 tables

    ACM Class: I.6; K.3; K.4; A.2