Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

Yin, Han; Bai, Jisheng; Wang, Mou; Shi, Dongyuan; Gan, Woon-Seng; Chen, Jianfeng

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2311.14068 (eess)

[Submitted on 23 Nov 2023 (v1), last revised 7 Dec 2023 (this version, v2)]

Title:Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

Authors:Han Yin, Jisheng Bai, Mou Wang, Dongyuan Shi, Woon-Seng Gan, Jianfeng Chen

View PDF HTML (experimental)

Abstract:Traditional binary hard labels for sound event detection (SED) lack details about the complexity and variability of sound event distributions. Recently, a novel annotation workflow is proposed to generate fine-grained non-binary soft labels, resulting in a new real-life dataset named MAESTRO Real for SED. In this paper, we first propose an interactive dual-conformer (IDC) module, in which a cross-interaction mechanism is applied to effectively exploit the information from soft labels. In addition, a novel scene-inspired mask (SIM) based on soft labels is incorporated for more precise SED predictions. The SIM is initially generated through a statistical approach, referred as SIM-V1. However, the fixed artificial mask may mismatch the SED model, resulting in limited effectiveness. Therefore, we further propose SIM-V2, which employs a word embedding model for adaptive SIM estimation. Experimental results show that the proposed IDC module can effectively utilize the information from soft labels, and the integration of SIM-V1 can further improve the accuracy. In addition, the impact of different word embedding dimensions on SIM-V2 is explored, and the results show that the appropriate dimension can enable SIM-V2 achieve superior performance than SIM-V1. In DCASE 2023 Challenge Task4B, the proposed system achieved the top ranking performance on the evaluation dataset of MAESTRO Real.

Comments:	to be improved (unfinished)
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2311.14068 [eess.AS]
	(or arXiv:2311.14068v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2311.14068

Submission history

From: Han Yin [view email]
[v1] Thu, 23 Nov 2023 15:51:53 UTC (1,883 KB)
[v2] Thu, 7 Dec 2023 14:37:14 UTC (2,957 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Interactive Dual-Conformer with Scene-Inspired Mask for Soft Sound Event Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators