Few-Shot Speaker Identification Using Lightweight Prototypical Network with Feature Grouping and Interaction

Li, Yanxiong; Chen, Hao; Cao, Wenchang; Huang, Qisheng; He, Qianhua

doi:10.1109/TMM.2023.3253301

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2305.19541 (eess)

[Submitted on 31 May 2023]

Title:Few-Shot Speaker Identification Using Lightweight Prototypical Network with Feature Grouping and Interaction

Authors:Yanxiong Li, Hao Chen, Wenchang Cao, Qisheng Huang, Qianhua He

View PDF

Abstract:Existing methods for few-shot speaker identification (FSSI) obtain high accuracy, but their computational complexities and model sizes need to be reduced for lightweight applications. In this work, we propose a FSSI method using a lightweight prototypical network with the final goal to implement the FSSI on intelligent terminals with limited resources, such as smart watches and smart speakers. In the proposed prototypical network, an embedding module is designed to perform feature grouping for reducing the memory requirement and computational complexity, and feature interaction for enhancing the representational ability of the learned speaker embedding. In the proposed embedding module, audio feature of each speech sample is split into several low-dimensional feature subsets that are transformed by a recurrent convolutional block in parallel. Then, the operations of averaging, addition, concatenation, element-wise summation and statistics pooling are sequentially executed to learn a speaker embedding for each speech sample. The recurrent convolutional block consists of a block of bidirectional long short-term memory, and a block of de-redundancy convolution in which feature grouping and interaction are conducted too. Our method is compared to baseline methods on three datasets that are selected from three public speech corpora (VoxCeleb1, VoxCeleb2, and LibriSpeech). The results show that our method obtains higher accuracy under several conditions, and has advantages over all baseline methods in computational complexity and model size.

Comments:	12 pages, 4 figures, 12 tables. Accepted for publication in IEEE TMM
Subjects:	Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2305.19541 [eess.AS]
	(or arXiv:2305.19541v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2305.19541
Related DOI:	https://doi.org/10.1109/TMM.2023.3253301

Submission history

From: Yanxiong Li [view email]
[v1] Wed, 31 May 2023 04:09:50 UTC (611 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Few-Shot Speaker Identification Using Lightweight Prototypical Network with Feature Grouping and Interaction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Few-Shot Speaker Identification Using Lightweight Prototypical Network with Feature Grouping and Interaction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators