DAVE: Diagnostic benchmark for Audio Visual Evaluation

Radevski, Gorjan; Popordanoska, Teodora; Blaschko, Matthew B.; Tuytelaars, Tinne

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.09321 (cs)

[Submitted on 12 Mar 2025]

Title:DAVE: Diagnostic benchmark for Audio Visual Evaluation

Authors:Gorjan Radevski, Teodora Popordanoska, Matthew B. Blaschko, Tinne Tuytelaars

View PDF

Abstract:Audio-visual understanding is a rapidly evolving field that seeks to integrate and interpret information from both auditory and visual modalities. Despite recent advances in multi-modal learning, existing benchmarks often suffer from strong visual bias -- where answers can be inferred from visual data alone -- and provide only aggregate scores that conflate multiple sources of error. This makes it difficult to determine whether models struggle with visual understanding, audio interpretation, or audio-visual alignment. In this work, we introduce DAVE (Diagnostic Audio Visual Evaluation), a novel benchmark dataset designed to systematically evaluate audio-visual models across controlled challenges. DAVE alleviates existing limitations by (i) ensuring both modalities are necessary to answer correctly and (ii) decoupling evaluation into atomic subcategories. Our detailed analysis of state-of-the-art models reveals specific failure modes and provides targeted insights for improvement. By offering this standardized diagnostic framework, we aim to facilitate more robust development of audio-visual models. The dataset is released: this https URL

Comments:	First two authors contributed equally
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2503.09321 [cs.CV]
	(or arXiv:2503.09321v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.09321

Submission history

From: Gorjan Radevski [view email]
[v1] Wed, 12 Mar 2025 12:12:46 UTC (6,046 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DAVE: Diagnostic benchmark for Audio Visual Evaluation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DAVE: Diagnostic benchmark for Audio Visual Evaluation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators