Search | arXiv e-print repository

BLAB: Brutally Long Audio Bench

Authors: Orevaoghene Ahia, Martijn Bartelds, Kabir Ahuja, Hila Gonen, Valentin Hofmann, Siddhant Arora, Shuyue Stella Li, Vishal Puttagunta, Mofetoluwa Adeyemi, Charishma Buchireddy, Ben Walls, Noah Bennett, Shinji Watanabe, Noah A. Smith, Yulia Tsvetkov, Sachin Kumar

Abstract: Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limite… ▽ More Developing large audio language models (LMs) capable of understanding diverse spoken interactions is essential for accommodating the multimodal nature of human communication and can increase the accessibility of language technologies across different user populations. Recent work on audio LMs has primarily evaluated their performance on short audio segments, typically under 30 seconds, with limited exploration of long-form conversational speech segments that more closely reflect natural user interactions with these models. We introduce Brutally Long Audio Bench (BLAB), a challenging long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks using audio segments averaging 51 minutes in length. BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers. Our audio data were collected from permissively licensed sources and underwent a human-assisted filtering process to ensure task compliance. We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks in BLAB. Our comprehensive analysis reveals key insights into the trade-offs between task difficulty and audio duration. In general, we find that audio LMs struggle with long-form speech, with performance declining as duration increases. They perform poorly on localization, temporal reasoning, counting, and struggle to understand non-phonemic information, relying more on prompts than audio content. BLAB serves as a challenging evaluation framework to develop audio LMs with robust long-form audio understanding capabilities. △ Less

Submitted 12 May, 2025; v1 submitted 5 May, 2025; originally announced May 2025.

arXiv:2501.01464 [pdf, other]

doi 10.1007/978-3-031-43999-5_13

Estimation of 3T MR images from 1.5T images regularized with Physics based Constraint

Authors: Prabhjot Kaur, Atul Singh Minhas, Chirag Kamal Ahuja, Anil Kumar Sao

Abstract: Limited accessibility to high field MRI scanners (such as 7T, 11T) has motivated the development of post-processing methods to improve low field images. Several existing post-processing methods have shown the feasibility to improve 3T images to produce 7T-like images [3,18]. It has been observed that improving lower field (LF, <=1.5T) images comes with additional challenges due to poor image quali… ▽ More Limited accessibility to high field MRI scanners (such as 7T, 11T) has motivated the development of post-processing methods to improve low field images. Several existing post-processing methods have shown the feasibility to improve 3T images to produce 7T-like images [3,18]. It has been observed that improving lower field (LF, <=1.5T) images comes with additional challenges due to poor image quality such as the function mapping 1.5T and higher field (HF, 3T) images is more complex than the function relating 3T and 7T images [10]. Except for [10], no method has been addressed to improve <=1.5T MRI images. Further, most of the existing methods [3,18] including [10] require example images, and also often rely on pixel to pixel correspondences between LF and HF images which are usually inaccurate for <=1.5T images. The focus of this paper is to address the unsupervised framework for quality improvement of 1.5T images and avoid the expensive requirements of example images and associated image registration. The LF and HF images are assumed to be related by a linear transformation (LT). The unknown HF image and unknown LT are estimated in alternate minimization framework. Further, a physics based constraint is proposed that provides an additional non-linear function relating LF and HF images in order to achieve the desired high contrast in estimated HF image. The experimental results demonstrate that the proposed approach provides processed 1.5T images, i.e., estimated 3T-like images with improved image quality, and is comparably better than the existing methods addressing similar problems. The improvement in image quality is also shown to provide better tissue segmentation and volume quantification as compared to scanner acquired 1.5T images. △ Less

Submitted 31 December, 2024; originally announced January 2025.

Comments: conference paper

Journal ref: Medical Image Computing and Computer Assisted Intervention - MICCAI 2023. Lecture Notes in Computer Science, vol 14229. Springer, Cham

arXiv:2405.01600 [pdf, ps, other]

Improved and Explainable Cervical Cancer Classification using Ensemble Pooling of Block Fused Descriptors

Authors: Saurabh Saini, Kapil Ahuja, Akshat S. Chauhan

Abstract: Cervical cancer is the second most common cancer in women and causes high death rates. Earlier models for detecting cervical cancer had limited success. In this work, we propose new models that substantially outperform previous models. Previous studies show that pretrained ResNets extract features from cervical cancer images well. Hence, our first model involves working with three ResNets (50, 1… ▽ More Cervical cancer is the second most common cancer in women and causes high death rates. Earlier models for detecting cervical cancer had limited success. In this work, we propose new models that substantially outperform previous models. Previous studies show that pretrained ResNets extract features from cervical cancer images well. Hence, our first model involves working with three ResNets (50, 101, 152). All the existing works use only the last convolution block of their respective ResNet, which captures abstract features (e.g., shapes, objects). However, we believe that detailed features (e.g., color, edges, texture), coming from earlier convolution blocks, are equally important for cancer (specifically cervical cancer) classification. Since now the number of features become large, we use a novel feature selection technique of Global Max Pooling for detailed features and Global Average Pooling for abstract features. Hence, our second model consists of the resulting Cascaded Block Fused variants of the three ResNets. To improve the performance further, we combine and normalize the features of the three standard ResNets as well as our proposed three Cascaded Block Fused ResNets. This type of combination is also new in cancer classification domain (also in cervical cancer), and results in our third and fourth models, respectively. We use a linear SVM for classification. We exhaustively perform experiments on two public datasets, IARC and AnnoCerv, achieving an average performance of 97.92% and 92.97% surpassing standard ResNets performance of 90.89% and 87.97%, respectively. We outperform the competitive approach available on IARC dataset with an average gain of 13.20%, while no prior competitive work available on AnnoCerv. Additionally, we introduce a novel SHAP+LIME explainability method, accurately identifying the cancerous region in 97% of cases. △ Less

Submitted 24 June, 2025; v1 submitted 1 May, 2024; originally announced May 2024.

Comments: 26 Pages, 10 figures, and 8 tables

ACM Class: I.2.1; I.5.2

Showing 1–3 of 3 results for author: Ahuja, K