Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Gusev, Aleksei; Volokhov, Vladimir; Andzhukaev, Tseren; Novoselov, Sergey; Lavrentyeva, Galina; Volkova, Marina; Gazizullina, Alice; Shulipa, Andrey; Gorlanov, Artem; Avdeeva, Anastasia; Ivanov, Artem; Kozlov, Alexander; Pekhovsky, Timur; Matveev, Yuri

Computer Science > Sound

arXiv:2002.06033v1 (cs)

[Submitted on 14 Feb 2020]

Title:Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Authors:Aleksei Gusev, Vladimir Volokhov, Tseren Andzhukaev, Sergey Novoselov, Galina Lavrentyeva, Marina Volkova, Alice Gazizullina, Andrey Shulipa, Artem Gorlanov, Anastasia Avdeeva, Artem Ivanov, Alexander Kozlov, Timur Pekhovsky, Yuri Matveev

View PDF

Abstract:Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical point of view, taking into account the increased interest in virtual assistants (such as Amazon Alexa, Google Home, AppleSiri, etc.), speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances. For these purposes, we considered deep neural network architectures based on TDNN (TimeDelay Neural Network) and ResNet (Residual Neural Network) blocks. We experimented with state-of-the-art embedding extractors and their training procedures. Obtained results confirm that ResNet architectures outperform the standard x-vector approach in terms of speaker verification quality for both long-duration and short-duration utterances. We also investigate the impact of speech activity detector, different scoring models, adaptation and score normalization techniques. The experimental results are presented for publicly available data and verification protocols for the VoxCeleb1, VoxCeleb2, and VOiCES datasets.

Comments:	Submitted to Odyssey 2020
Subjects:	Sound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS); Machine Learning (stat.ML)
Cite as:	arXiv:2002.06033 [cs.SD]
	(or arXiv:2002.06033v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2002.06033

Submission history

From: Sergey Novoselov [view email]
[v1] Fri, 14 Feb 2020 13:34:33 UTC (160 KB)

Computer Science > Sound

Title:Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators