Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Gao, Kuofeng; Xia, Shu-Tao; Xu, Ke; Torr, Philip; Gu, Jindong

Computer Science > Artificial Intelligence

arXiv:2412.05167 (cs)

[Submitted on 6 Dec 2024 (v1), last revised 28 Jul 2025 (this version, v2)]

Title:Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Authors:Kuofeng Gao, Shu-Tao Xia, Ke Xu, Philip Torr, Jindong Gu

View PDF HTML (experimental)

Abstract:Large Audio-Language Models (LALMs), such as GPT-4o, have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans. The potential of LALMs broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments on 16 LALMs, our analysis reveals that existing LALMs struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones. The benchmark is available at this https URL.

Comments:	Accepted by ACL 2025
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2412.05167 [cs.AI]
	(or arXiv:2412.05167v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2412.05167

Submission history

From: Kuofeng Gao [view email]
[v1] Fri, 6 Dec 2024 16:34:15 UTC (802 KB)
[v2] Mon, 28 Jul 2025 15:07:08 UTC (923 KB)

Computer Science > Artificial Intelligence

Title:Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators