Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Kuan, Chun-Yi; Lee, Hung-yi

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2410.16130 (eess)

[Submitted on 21 Oct 2024 (v1), last revised 31 Dec 2024 (this version, v2)]

Title:Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Authors:Chun-Yi Kuan, Hung-yi Lee

View PDF HTML (experimental)

Abstract:Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a multi-turn chain-of-thought approach, which demonstrates significantly improved model performance across the proposed tasks.

Comments:	Accepted to ICASSP 2025. Project Website: this https URL
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2410.16130 [eess.AS]
	(or arXiv:2410.16130v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2410.16130

Submission history

From: Chun-Yi Kuan [view email]
[v1] Mon, 21 Oct 2024 15:55:27 UTC (3,434 KB)
[v2] Tue, 31 Dec 2024 09:35:31 UTC (3,434 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators