Skip to main content

Showing 1–18 of 18 results for author: Nieto, O

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.07365  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.MM eess.AS

    Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

    Authors: Chao-Han Huck Yang, Sreyan Ghosh, Qing Wang, Jaeyeon Kim, Hengyi Hong, Sonal Kumar, Guirui Zhong, Zhifeng Kong, S Sakshi, Vaibhavi Lokegaonkar, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha, Gunhee Kim, Jun Du, Rafael Valle, Bryan Catanzaro

    Abstract: We present Task 5 of the DCASE 2025 Challenge: an Audio Question Answering (AQA) benchmark spanning multiple domains of sound understanding. This task defines three QA subsets (Bioacoustics, Temporal Soundscapes, and Complex QA) to test audio-language models on interactive question-answering over diverse acoustic scenes. We describe the dataset composition (from marine mammal calls to soundscapes… ▽ More

    Submitted 12 May, 2025; originally announced May 2025.

    Comments: Preprint. DCASE 2025 Audio QA Challenge: https://dcase.community/challenge2025/task-audio-question-answering

  2. arXiv:2505.05335  [pdf, ps, other

    cs.SD eess.AS

    FLAM: Frame-Wise Language-Audio Modeling

    Authors: Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon

    Abstract: Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are… ▽ More

    Submitted 8 June, 2025; v1 submitted 8 May, 2025; originally announced May 2025.

    Comments: Accepted at ICML 2025 V2: fixed small typo on eq. 15 and eq. 17

  3. arXiv:2504.09730  [pdf, ps, other

    eess.SY math.OC

    Learning-based decentralized control with collision avoidance for multi-agent systems

    Authors: Omayra Yago Nieto, Alexandre Anahory Simoes, Juan I. Giribet, Leonardo J. Colombo

    Abstract: In this paper, we present a learning-based tracking controller based on Gaussian processes (GP) for collision avoidance of multi-agent systems where the agents evolve in the special Euclidean group in the space SE(3). In particular, we use GPs to estimate certain uncertainties that appear in the dynamics of the agents. The control algorithm is designed to learn and mitigate these uncertainties by… ▽ More

    Submitted 13 April, 2025; originally announced April 2025.

    Comments: 9 pages

  4. arXiv:2412.09789  [pdf, other

    cs.SD eess.AS

    SILA: Signal-to-Language Augmentation for Enhanced Control in Text-to-Audio Generation

    Authors: Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, Oriol Nieto

    Abstract: The field of text-to-audio generation has seen significant advancements, and yet the ability to finely control the acoustic characteristics of generated audio remains under-explored. In this paper, we introduce a novel yet simple approach to generate sound effects with control over key acoustic parameters such as loudness, pitch, reverb, fade, brightness, noise and duration, enabling creative appl… ▽ More

    Submitted 12 December, 2024; originally announced December 2024.

    Comments: Website: https://sonalkum.github.io/SILA/

  5. arXiv:2412.08550  [pdf, other

    cs.SD eess.AS

    Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

    Authors: Hugo Flores GarcĂ­a, Oriol Nieto, Justin Salamon, Bryan Pardo, Prem Seetharaman

    Abstract: We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffus… ▽ More

    Submitted 14 April, 2025; v1 submitted 11 December, 2024; originally announced December 2024.

  6. arXiv:2411.17698  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Video-Guided Foley Sound Generation with Multimodal Controls

    Authors: Ziyang Chen, Prem Seetharaman, Bryan Russell, Oriol Nieto, David Bourgin, Andrew Owens, Justin Salamon

    Abstract: Generating sound effects for videos often requires creating artistic sound effects that diverge significantly from real-life sources and flexible control in the sound design. To address this problem, we introduce MultiFoley, a model designed for video-guided sound generation that supports multimodal conditioning through text, audio, and video. Given a silent video and a text prompt, MultiFoley all… ▽ More

    Submitted 17 March, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: Accepted at CVPR 2025. Project site: https://ificl.github.io/MultiFoley/

  7. arXiv:2410.19168  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    Authors: S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

    Abstract: The ability to comprehend audio--which includes speech, non-speech sounds, and music--is crucial for AI agents to interact effectively with the world. We present MMAU, a novel benchmark designed to evaluate multimodal audio understanding models on tasks requiring expert-level knowledge and complex reasoning. MMAU comprises 10k carefully curated audio clips paired with human-annotated natural langu… ▽ More

    Submitted 24 October, 2024; originally announced October 2024.

    Comments: Project Website: https://sakshi113.github.io/mmau_homepage/

  8. arXiv:2409.11498  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning

    Authors: Ilaria Manco, Justin Salamon, Oriol Nieto

    Abstract: Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a mor… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

    Comments: To appear in the Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)

  9. arXiv:2409.09213  [pdf, other

    eess.AS cs.CL cs.SD

    ReCLAP: Improving Zero Shot Audio Classification by Describing Sounds

    Authors: Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: Open-vocabulary audio-language models, like CLAP, offer a promising approach for zero-shot audio classification (ZSAC) by enabling classification with any arbitrary set of categories specified with natural language prompts. In this paper, we propose a simple but effective method to improve ZSAC with CLAP. Specifically, we shift from the conventional method of using prompts with abstract category l… ▽ More

    Submitted 13 September, 2024; originally announced September 2024.

    Comments: Code and Checkpoints: https://github.com/Sreyan88/ReCLAP

  10. arXiv:2406.11768  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

    Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including feat… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Project Website: https://sreyan88.github.io/gamaaudio/

  11. arXiv:2310.08753  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

    Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Evuru, S. Ramaneswaran, S. Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perfo… ▽ More

    Submitted 30 July, 2024; v1 submitted 12 October, 2023; originally announced October 2023.

    Comments: ICLR 2024. Project Page: https://sreyan88.github.io/compa_iclr/

  12. arXiv:2308.09089  [pdf, other

    cs.SD cs.CV cs.IR cs.MM eess.AS

    Bridging High-Quality Audio and Video via Language for Sound Effects Retrieval from Visual Queries

    Authors: Julia Wilkins, Justin Salamon, Magdalena Fuentes, Juan Pablo Bello, Oriol Nieto

    Abstract: Finding the right sound effects (SFX) to match moments in a video is a difficult and time-consuming task, and relies heavily on the quality and completeness of text metadata. Retrieving high-quality (HQ) SFX using a video frame directly as the query is an attractive alternative, removing the reliance on text metadata and providing a low barrier to entry for non-experts. Due to the lack of HQ audio… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: WASPAA 2023. Project page: https://juliawilkins.github.io/sound-effects-retrieval-from-video/. 4 pages, 2 figures, 2 tables

  13. arXiv:2303.10667  [pdf, other

    cs.SD eess.AS

    Audio-Text Models Do Not Yet Leverage Natural Language

    Authors: Ho-Hsiang Wu, Oriol Nieto, Juan Pablo Bello, Justin Salamon

    Abstract: Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In… ▽ More

    Submitted 19 March, 2023; originally announced March 2023.

    Comments: Copyright 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  14. arXiv:2204.13289  [pdf, other

    cs.SD cs.LG eess.AS

    Music Enhancement via Image Translation and Vocoding

    Authors: Nikhil Kandpal, Oriol Nieto, Zeyu Jin

    Abstract: Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality music recordings by combining (i) an image-to-image translation model for manipulating audio in its mel-spectrogram representation and (ii) a music vocoding mode… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: ICASSP 2022

  15. arXiv:2010.16030  [pdf, other

    cs.IR cs.MM cs.SD eess.AS

    Multimodal Metric Learning for Tag-based Music Retrieval

    Authors: Minz Won, Sergio Oramas, Oriol Nieto, Fabien Gouyon, Xavier Serra

    Abstract: Tag-based music retrieval is crucial to browse large-scale music libraries efficiently. Hence, automatic music tagging has been actively explored, mostly as a classification task, which has an inherent limitation: a fixed vocabulary. On the other hand, metric learning enables flexible vocabularies by using pretrained word embeddings as side information. Also, metric learning has already proven its… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: 5 pages, 2 figures, submitted to ICASSP 2021

  16. arXiv:2010.11512  [pdf, other

    cs.SD cs.IR eess.AS

    Mood Classification Using Listening Data

    Authors: Filip Korzeniowski, Oriol Nieto, Matthew McCallum, Minz Won, Sergio Oramas, Erik Schmidt

    Abstract: The mood of a song is a highly relevant feature for exploration and recommendation in large collections of music. These collections tend to require automatic methods for predicting such moods. In this work, we show that listening-based features outperform content-based ones when classifying moods: embeddings obtained through matrix factorization of listening data appear to be more informative of a… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Appears in Proc. of the International Society for Music Information Retrieval Conference 2020 (ISMIR 2020)

  17. arXiv:1802.03319  [pdf, other

    stat.ML cs.SD eess.AS

    Predicting Audio Advertisement Quality

    Authors: Samaneh Ebrahimi, Hossein Vahabi, Matthew Prockup, Oriol Nieto

    Abstract: Online audio advertising is a particular form of advertising used abundantly in online music streaming services. In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement. Therefore, the automatic assessment of these ads is an important step toward audio ads rankin… ▽ More

    Submitted 9 February, 2018; originally announced February 2018.

    Comments: WSDM '18 Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 9 pages

    Journal ref: 2018. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM '18)

  18. arXiv:1711.02520  [pdf, other

    cs.SD eess.AS

    End-to-end learning for music audio tagging at scale

    Authors: Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, Xavier Serra

    Abstract: The lack of data tends to limit the outcomes of deep learning research, particularly when dealing with end-to-end learning stacks processing raw data such as waveforms. In this study, 1.2M tracks annotated with musical labels are available to train our end-to-end models. This large amount of data allows us to unrestrictedly explore two different design paradigms for music auto-tagging: assumption-… ▽ More

    Submitted 15 June, 2018; v1 submitted 7 November, 2017; originally announced November 2017.

    Comments: Presented at the Workshop on Machine Learning for Audio Signal Processing (ML4Audio) at NIPS 2017, and in proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR2018). Code: https://github.com/jordipons/music-audio-tagging-at-scale-models. Demo: http://www.jordipons.me/apps/music-audio-tagging-at-scale-demo/