Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

Kewei, Li; Hengshun, Zhou; Kai, Shen; Yusheng, Dai; Jun, Du

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2412.20805 (eess)

[Submitted on 30 Dec 2024]

Title:Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

Authors:Li Kewei, Zhou Hengshun, Shen Kai, Dai Yusheng, Du Jun

View PDF HTML (experimental)

Abstract:User-defined keyword spotting (KWS) enhances the user experience by allowing individuals to customize keywords. However, in open-vocabulary scenarios, most existing methods commonly suffer from high false alarm rates with confusable words and are limited to either audio-only or text-only enrollment. Therefore, in this paper, we first explore the model's robustness against confusable words. Specifically, we propose Phoneme-Level Contrastive Learning (PLCL), which refines and aligns query and source feature representations at the phoneme level. This method enhances the model's disambiguation capability through fine-grained positive and negative comparisons for more accurate alignment, and it is generalizable to jointly optimize both audio-text and audio-audio matching, adapting to various enrollment modes. Furthermore, we maintain a context-agnostic phoneme memory bank to construct confusable negatives for data augmentation. Based on this, a third-category discriminator is specifically designed to distinguish hard negatives. Overall, we develop a robust and flexible KWS system, supporting different modality enrollment methods within a unified framework. Verified on the LibriPhrase dataset, the proposed approach achieves state-of-the-art performance.

Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2412.20805 [eess.AS]
	(or arXiv:2412.20805v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2412.20805

Submission history

From: Kewei Li [view email]
[v1] Mon, 30 Dec 2024 08:59:20 UTC (1,365 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Phoneme-Level Contrastive Learning for User-Defined Keyword Spotting with Flexible Enrollment

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators