Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Labbé, Etienne; Pellegrini, Thomas; Pinquier, Julien

Computer Science > Computation and Language

arXiv:2308.15090 (cs)

[Submitted on 29 Aug 2023]

Title:Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Authors:Etienne Labbé (IRIT-SAMoVA), Thomas Pellegrini (IRIT-SAMoVA), Julien Pinquier (IRIT-SAMoVA)

View PDF

Abstract:Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given textual query (Text-to-Audio) or vice versa (Audio-to-Text). These tasks require different types of systems: AAC employs a sequence-to-sequence model, while ATR utilizes a ranking model that compares audio and text representations within a shared projection subspace. However, this work investigates the relationship between AAC and ATR by exploring the ATR capabilities of an unmodified AAC system, without fine-tuning for the new task. Our AAC system consists of an audio encoder (ConvNeXt-Tiny) trained on AudioSet for audio tagging, and a transformer decoder responsible for generating sentences. For AAC, it achieves a high SPIDEr-FL score of 0.298 on Clotho and 0.472 on AudioCaps on average. For ATR, we propose using the standard Cross-Entropy loss values obtained for any audio/caption pair. Experimental results on the Clotho and AudioCaps datasets demonstrate decent recall values using this simple approach. For instance, we obtained a Text-to-Audio R@1 value of 0.382 for Au-dioCaps, which is above the current state-of-the-art method without external data. Interestingly, we observe that normalizing the loss values was necessary for Audio-to-Text retrieval.

Comments:	cam ready version (14/08/23)
Subjects:	Computation and Language (cs.CL); Information Retrieval (cs.IR); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2308.15090 [cs.CL]
	(or arXiv:2308.15090v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2308.15090
Journal reference:	DCASE2023, Sep 2023, Tampere, Finland

Submission history

From: Etienne Labbe [view email] [via CCSD proxy]
[v1] Tue, 29 Aug 2023 07:53:17 UTC (290 KB)

Computer Science > Computation and Language

Title:Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators