SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Seth, Ashish; Ghosh, Sreyan; Umesh, S.; Manocha, Dinesh

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2211.01519 (eess)

[Submitted on 2 Nov 2022 (v1), last revised 18 May 2023 (this version, v2)]

Title:SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Authors:Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

View PDF

Abstract:We present a new Self-Supervised Learning (SSL) approach to pre-train encoders on unlabeled audio data that reduces the need for large amounts of labeled data for audio and speech classification. Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks in a low-resource un-labeled audio pre-training setting. Inspired by the recent success of clustering and contrasting learning paradigms for SSL-based speech representation learning, we propose SLICER (Symmetrical Learning of Instance and Cluster-level Efficient Representations), which brings together the best of both clustering and contrasting learning paradigms. We use a symmetric loss between latent representations from student and teacher encoders and simultaneously solve instance and cluster-level contrastive learning tasks. We obtain cluster representations online by just projecting the input spectrogram into an output subspace with dimensions equal to the number of clusters. In addition, we propose a novel mel-spectrogram augmentation procedure, k-mix, based on mixup, which does not require labels and aids unsupervised representation learning for audio. Overall, SLICER achieves state-of-the-art results on the LAPE Benchmark \cite{9868132}, significantly outperforming DeLoRes-M and other prior approaches, which are pre-trained on $10\times$ larger of unsupervised data. We will make all our codes available on GitHub.

Comments:	ICASSP 2023
Subjects:	Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
Cite as:	arXiv:2211.01519 [eess.AS]
	(or arXiv:2211.01519v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2211.01519

Submission history

From: Sreyan Ghosh [view email]
[v1] Wed, 2 Nov 2022 23:45:33 UTC (39,231 KB)
[v2] Thu, 18 May 2023 01:31:48 UTC (36,185 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:SLICER: Learning universal audio representations using low-resource self-supervised pre-training

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators