-
Reverb: Open-Source ASR and Diarization from Rev
Authors:
Nishchal Bhandari,
Danny Chen,
Miguel Ángel del Río Fernández,
Natalie Delworth,
Jennifer Drexler Fox,
Migüel Jetté,
Quinten McNamara,
Corey Miller,
Ondřej Novotný,
Ján Profant,
Nan Qin,
Martin Ratajczak,
Jean-Philippe Robichaud
Abstract:
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all exi…
▽ More
Today, we are open-sourcing our core speech recognition and diarization models for non-commercial use. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.
△ Less
Submitted 24 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
BUT VOiCES 2019 System Description
Authors:
Hossein Zeinali,
Pavel Matějka,
Ladislav Mošner,
Oldřich Plchot,
Anna Silnova,
Ondřej Novotný,
Ján Profant,
Ondřej Glembek,
Lukáš Burget
Abstract:
This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptatio…
▽ More
This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptation and achieved less than ~10% relative improvement. In the submission to open condition, we used 3 x-vector systems and also one i-vector based system.
△ Less
Submitted 13 July, 2019;
originally announced July 2019.
-
Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition
Authors:
Ondrej Novotny,
Oldrich Plchot,
Ondrej Glembek,
Lukas Burget
Abstract:
In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same…
▽ More
In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same time focus the model towards extraction of speaker-related information. We show that it is possible to represent a standard generative i-vector extractor by a model with significantly less parameters and obtain similar performance on SV tasks. We can further refine this compact model by discriminative training and obtain i-vectors that lead to better performance on various SV benchmarks representing different acoustic domains.
△ Less
Submitted 5 April, 2019;
originally announced April 2019.
-
Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition
Authors:
Ondrej Novotny,
Oldrich Plchot,
Ondrej Glembek,
Jan "Honza" Cernocky,
Lukas Burget
Abstract:
In this work, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker verification (SV) system. We start our approach by carefully designing a data augmentation process to cover wide range of acoustic conditions and obtain rich training data for various components of our SV system. We augment several well-k…
▽ More
In this work, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker verification (SV) system. We start our approach by carefully designing a data augmentation process to cover wide range of acoustic conditions and obtain rich training data for various components of our SV system. We augment several well-known databases used in SV with artificially noised and reverberated data and we use them to train a denoising autoencoder (mapping noisy and reverberated speech to its clean version) as well as an x-vector extractor which is currently considered as state-of-the-art in SV. Later, we use the autoencoder as a preprocessing step for text-independent SV system. We compare results achieved with autoencoder enhancement, multi-condition PLDA training and their simultaneous use. We present a detailed analysis with various conditions of NIST SRE 2010, 2016, PRISM and with re-transmitted data. We conclude that the proposed preprocessing can significantly improve both i-vector and x-vector baselines and that this technique can be used to build a robust SV system for various target domains.
△ Less
Submitted 19 November, 2018;
originally announced November 2018.
-
On the use of DNN Autoencoder for Robust Speaker Recognition
Authors:
Ondrej Novotny,
Oldrich Plchot,
Pavel Matejka,
Ondrej Glembek
Abstract:
In this paper, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker recognition system. We started with augmenting the Fisher database with artificially noised and reverberated data and we trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a prepr…
▽ More
In this paper, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker recognition system. We started with augmenting the Fisher database with artificially noised and reverberated data and we trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a preprocessing step for a state-of-the-art text-independent speaker recognition system. We compare results achieved with pure autoencoder enhancement, multi-condition PLDA training and their simultaneous use. We present a detailed analysis with various conditions of NIST SRE 2010, PRISM and artificially corrupted NIST SRE 2010 telephone condition. We conclude that the proposed preprocessing significantly outperforms the baseline and that this technique can be used to build a robust speaker recognition system for reverberated and noisy data.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Discriminatively Re-trained i-vector Extractor for Speaker Recognition
Authors:
Ondrej Novotny,
Oldrich Plchot,
Ondrej Glembek,
Lukas Burget,
Pavel Matejka
Abstract:
In this work we revisit discriminative training of the i-vector extractor component in the standard speaker verification (SV) system. The motivation of our research lies in the robustness and stability of this large generative model, which we want to preserve, and focus its power towards any intended SV task. We show that after generative initialization of the i-vector extractor, we can further re…
▽ More
In this work we revisit discriminative training of the i-vector extractor component in the standard speaker verification (SV) system. The motivation of our research lies in the robustness and stability of this large generative model, which we want to preserve, and focus its power towards any intended SV task. We show that after generative initialization of the i-vector extractor, we can further refine it with discriminative training and obtain i-vectors that lead to better performance on various benchmarks representing different acoustic domains.
△ Less
Submitted 31 October, 2018;
originally announced October 2018.