-
BLAB: Brutally Long Audio Bench
Authors:
Orevaoghene Ahia,
Martijn Bartelds,
Kabir Ahuja,
Hila Gonen,
Valentin Hofmann,
Siddhant Arora,
Shuyue Stella Li,
Vishal Puttagunta,
Mofetoluwa Adeyemi,
Charishma Buchireddy,
Ben Walls,
Noah Bennett,
Shinji Watanabe,
Noah A. Smith,
Yulia Tsvetkov,
Sachin Kumar
Abstract:
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite…
▽ More
Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities.
△ Less
Submitted 12 May, 2025; v1 submitted 5 May, 2025;
originally announced May 2025.
-
Texture Generation Using A Graph Generative Adversarial Network And Differentiable Rendering
Authors:
Dharma KC,
Clayton T. Morrison,
Bradley Walls
Abstract:
Novel photo-realistic texture synthesis is an important task for generating novel scenes, including asset generation for 3D simulations. However, to date, these methods predominantly generate textured objects in 2D space. If we rely on 2D object generation, then we need to make a computationally expensive forward pass each time we change the camera viewpoint or lighting. Recent work that can gener…
▽ More
Novel photo-realistic texture synthesis is an important task for generating novel scenes, including asset generation for 3D simulations. However, to date, these methods predominantly generate textured objects in 2D space. If we rely on 2D object generation, then we need to make a computationally expensive forward pass each time we change the camera viewpoint or lighting. Recent work that can generate textures in 3D requires 3D component segmentation that is expensive to acquire. In this work, we present a novel conditional generative architecture that we call a graph generative adversarial network (GGAN) that can generate textures in 3D by learning object component information in an unsupervised way. In this framework, we do not need an expensive forward pass whenever the camera viewpoint or lighting changes, and we do not need expensive 3D part information for training, yet the model can generalize to unseen 3D meshes and generate appropriate novel 3D textures. We compare this approach against state-of-the-art texture generation methods and demonstrate that the GGAN obtains significantly better texture generation quality (according to Frechet inception distance). We release our model source code as open source.
△ Less
Submitted 7 February, 2023; v1 submitted 17 June, 2022;
originally announced June 2022.
-
Federated Reconnaissance: Efficient, Distributed, Class-Incremental Learning
Authors:
Sean M. Hendryx,
Dharma Raj KC,
Bradley Walls,
Clayton T. Morrison
Abstract:
We describe federated reconnaissance, a class of learning problems in which distributed clients learn new concepts independently and communicate that knowledge efficiently. In particular, we propose an evaluation framework and methodological baseline for a system in which each client is expected to learn a growing set of classes and communicate knowledge of those classes efficiently with other cli…
▽ More
We describe federated reconnaissance, a class of learning problems in which distributed clients learn new concepts independently and communicate that knowledge efficiently. In particular, we propose an evaluation framework and methodological baseline for a system in which each client is expected to learn a growing set of classes and communicate knowledge of those classes efficiently with other clients, such that, after knowledge merging, the clients should be able to accurately discriminate between classes in the superset of classes observed by the set of clients. We compare a range of learning algorithms for this problem and find that prototypical networks are a strong approach in that they are robust to catastrophic forgetting while incorporating new information efficiently. Furthermore, we show that the online averaging of prototype vectors is effective for client model merging and requires only a small amount of communication overhead, memory, and update time per class with no gradient-based learning or hyperparameter tuning. Additionally, to put our results in context, we find that a simple, prototypical network with four convolutional layers significantly outperforms complex, state of the art continual learning algorithms, increasing the accuracy by over 22% after learning 600 Omniglot classes and over 33% after learning 20 mini-ImageNet classes incrementally. These results have important implications for federated reconnaissance and continual learning more generally by demonstrating that communicating feature vectors is an efficient, robust, and effective means for distributed, continual learning.
△ Less
Submitted 31 August, 2021;
originally announced September 2021.