Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Hu, Hu; Yang, Chao-Han Huck; Xia, Xianjun; Bai, Xue; Tang, Xin; Wang, Yajian; Niu, Shutong; Chai, Li; Li, Juanjuan; Zhu, Hongning; Bao, Feng; Zhao, Yuanjun; Siniscalchi, Sabato Marco; Wang, Yannan; Du, Jun; Lee, Chin-Hui

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2007.08389 (eess)

[Submitted on 16 Jul 2020 (v1), last revised 27 Aug 2020 (this version, v2)]

Title:Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Authors:Hu Hu, Chao-Han Huck Yang, Xianjun Xia, Xue Bai, Xin Tang, Yajian Wang, Shutong Niu, Li Chai, Juanjuan Li, Hongning Zhu, Feng Bao, Yuanjun Zhao, Sabato Marco Siniscalchi, Yannan Wang, Jun Du, Chin-Hui Lee

View PDF

Abstract:In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9\% is attained using our best single classifier and data augmentation. An accuracy of 81.9\% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7\% with a model size smaller than 500KB. Code is available: this https URL.

Comments:	Revised Technical Report. Proposed systems attain 2nds in both Task-1a and Task-1b in the official DCASE challenge 2020
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2007.08389 [eess.AS]
	(or arXiv:2007.08389v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2007.08389

Submission history

From: C.-H. Huck Yang [view email]
[v1] Thu, 16 Jul 2020 15:07:14 UTC (52 KB)
[v2] Thu, 27 Aug 2020 00:33:27 UTC (52 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators