Learning Temporal Resolution in Spectrogram for Audio Classification

Liu, Haohe; Liu, Xubo; Kong, Qiuqiang; Wang, Wenwu; Plumbley, Mark D.

Computer Science > Sound

arXiv:2210.01719 (cs)

[Submitted on 4 Oct 2022 (v1), last revised 12 Jan 2024 (this version, v3)]

Title:Learning Temporal Resolution in Spectrogram for Audio Classification

Authors:Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

View PDF HTML (experimental)

Abstract:The audio spectrogram is a time-frequency representation that has been widely used for audio classification. One of the key attributes of the audio spectrogram is the temporal resolution, which depends on the hop size used in the Short-Time Fourier Transform (STFT). Previous works generally assume the hop size should be a constant value (e.g., 10 ms). However, a fixed temporal resolution is not always optimal for different types of sound. The temporal resolution affects not only classification accuracy but also computational cost. This paper proposes a novel method, DiffRes, that enables differentiable temporal resolution modeling for audio classification. Given a spectrogram calculated with a fixed hop size, DiffRes merges non-essential time frames while preserving important frames. DiffRes acts as a "drop-in" module between an audio spectrogram and a classifier and can be jointly optimized with the classification task. We evaluate DiffRes on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the DiffRes-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We further show that DiffRes can improve classification accuracy by increasing the temporal resolution of input acoustic features, without adding to the computational cost.

Comments:	Accepted by the 38th Annual AAAI Conference on Artificial Intelligence
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Cite as:	arXiv:2210.01719 [cs.SD]
	(or arXiv:2210.01719v3 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2210.01719

Submission history

From: Haohe Liu [view email]
[v1] Tue, 4 Oct 2022 16:18:50 UTC (11,433 KB)
[v2] Wed, 5 Oct 2022 11:37:04 UTC (11,433 KB)
[v3] Fri, 12 Jan 2024 18:35:17 UTC (12,862 KB)

Computer Science > Sound

Title:Learning Temporal Resolution in Spectrogram for Audio Classification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Learning Temporal Resolution in Spectrogram for Audio Classification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators