Training Audio Captioning Models without Audio

Deshmukh, Soham; Elizalde, Benjamin; Emmanouilidou, Dimitra; Raj, Bhiksha; Singh, Rita; Wang, Huaming

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2309.07372 (eess)

[Submitted on 14 Sep 2023]

Title:Training Audio Captioning Models without Audio

Authors:Soham Deshmukh, Benjamin Elizalde, Dimitra Emmanouilidou, Bhiksha Raj, Rita Singh, Huaming Wang

View PDF

Abstract:Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The creation of these audio-caption pairs is costly, resulting in general data scarcity for the task. In this work, we address this major limitation and propose an approach to train AAC systems using only text. Our approach leverages the multimodal space of contrastively trained audio-text models, such as CLAP. During training, a decoder generates captions conditioned on the pretrained CLAP text encoder. During inference, the text encoder is replaced with the pretrained CLAP audio encoder. To bridge the modality gap between text and audio embeddings, we propose the use of noise injection or a learnable adapter, during training. We find that the proposed text-only framework performs competitively with state-of-the-art models trained with paired audio, showing that efficient text-to-audio transfer is possible. Finally, we showcase both stylized audio captioning and caption enrichment while training without audio or human-created text captions.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2309.07372 [eess.AS]
	(or arXiv:2309.07372v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2309.07372

Submission history

From: Soham Deshmukh [view email]
[v1] Thu, 14 Sep 2023 01:16:02 UTC (779 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Training Audio Captioning Models without Audio

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Training Audio Captioning Models without Audio

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators