RECAP: Retrieval-Augmented Audio Captioning

Ghosh, Sreyan; Kumar, Sonal; Evuru, Chandra Kiran Reddy; Duraiswami, Ramani; Manocha, Dinesh

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2309.09836 (eess)

[Submitted on 18 Sep 2023 (v1), last revised 6 Jun 2024 (this version, v2)]

Title:RECAP: Retrieval-Augmented Audio Captioning

Authors:Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, Dinesh Manocha

View PDF HTML (experimental)

Abstract:We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a training-free fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.

Comments:	ICASSP 2024. Code and data: this https URL
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2309.09836 [eess.AS]
	(or arXiv:2309.09836v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2309.09836

Submission history

From: Sreyan Ghosh [view email]
[v1] Mon, 18 Sep 2023 14:53:08 UTC (10,613 KB)
[v2] Thu, 6 Jun 2024 17:21:37 UTC (10,613 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:RECAP: Retrieval-Augmented Audio Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:RECAP: Retrieval-Augmented Audio Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators