Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Liu, Tianchi; Lee, Kong Aik; Wang, Qiongqiong; Li, Haizhou

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2312.03620 (eess)

[Submitted on 6 Dec 2023 (v1), last revised 24 Apr 2024 (this version, v3)]

Title:Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Authors:Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

View PDF HTML (experimental)

Abstract:Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.

Comments:	Accepted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. Open Access: this https URL
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2312.03620 [eess.AS]
	(or arXiv:2312.03620v3 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2312.03620

Submission history

From: Tianchi Liu [view email]
[v1] Wed, 6 Dec 2023 17:08:49 UTC (8,209 KB)
[v2] Wed, 27 Mar 2024 15:37:26 UTC (11,256 KB)
[v3] Wed, 24 Apr 2024 17:29:11 UTC (11,256 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators