-
STCON System for the CHiME-8 Challenge
Authors:
Anton Mitrofanov,
Tatiana Prisyach,
Tatiana Timofeeva,
Sergei Novoselov,
Maxim Korenevsky,
Yuri Khokhlov,
Artem Akulov,
Alexander Anikin,
Roman Khalili,
Iurii Lezhenin,
Aleksandr Melnikov,
Dmitriy Miroshnichenko,
Nikita Mamaev,
Ilya Odegov,
Olga Rudnitskaya,
Aleksei Romanenko
Abstract:
This paper describes the STCON system for the CHiME-8 Challenge Task 1 (DASR) aimed at distant automatic speech transcription and diarization with multiple recording devices. Our main attention was paid to carefully trained and tuned diarization pipeline and speaker counting. This allowed to significantly reduce diarization error rate (DER) and obtain more reliable segments for speech separation a…
▽ More
This paper describes the STCON system for the CHiME-8 Challenge Task 1 (DASR) aimed at distant automatic speech transcription and diarization with multiple recording devices. Our main attention was paid to carefully trained and tuned diarization pipeline and speaker counting. This allowed to significantly reduce diarization error rate (DER) and obtain more reliable segments for speech separation and recognition. To improve source separation, we designed a Guided Target speaker Extraction (G-TSE) model and used it in conjunction with the traditional Guided Source Separation (GSS) method. To train various parts of our pipeline, we investigated several data augmentation and generation techniques, which helped us to improve the overall system quality.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
LT-LM: a novel non-autoregressive language model for single-shot lattice rescoring
Authors:
Anton Mitrofanov,
Mariya Korenevskaya,
Ivan Podluzhny,
Yuri Khokhlov,
Aleksandr Laptev,
Andrei Andrusenko,
Aleksei Ilin,
Maxim Korenevsky,
Ivan Medennikov,
Aleksei Romanenko
Abstract:
Neural network-based language models are commonly used in rescoring approaches to improve the quality of modern automatic speech recognition (ASR) systems. Most of the existing methods are computationally expensive since they use autoregressive language models. We propose a novel rescoring approach, which processes the entire lattice in a single call to the model. The key feature of our rescoring…
▽ More
Neural network-based language models are commonly used in rescoring approaches to improve the quality of modern automatic speech recognition (ASR) systems. Most of the existing methods are computationally expensive since they use autoregressive language models. We propose a novel rescoring approach, which processes the entire lattice in a single call to the model. The key feature of our rescoring policy is a novel non-autoregressive Lattice Transformer Language Model (LT-LM). This model takes the whole lattice as an input and predicts a new language score for each arc. Additionally, we propose the artificial lattices generation approach to incorporate a large amount of text data in the LT-LM training process. Our single-shot rescoring performs orders of magnitude faster than other rescoring methods in our experiments. It is more than 300 times faster than pruned RNNLM lattice rescoring and N-best rescoring while slightly inferior in terms of WER.
△ Less
Submitted 6 April, 2021;
originally announced April 2021.
-
Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario
Authors:
Ivan Medennikov,
Maxim Korenevsky,
Tatiana Prisyach,
Yuri Khokhlov,
Mariya Korenevskaya,
Ivan Sorokin,
Tatiana Timofeeva,
Anton Mitrofanov,
Andrei Andrusenko,
Ivan Podluzhny,
Aleksandr Laptev,
Aleksei Romanenko
Abstract:
Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD mode…
▽ More
Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization. We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.
△ Less
Submitted 27 July, 2020; v1 submitted 14 May, 2020;
originally announced May 2020.
-
Exploring End-to-End Techniques for Low-Resource Speech Recognition
Authors:
Vladimir Bataev,
Maxim Korenevsky,
Ivan Medennikov,
Alexander Zatvornitskiy
Abstract:
In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 hours). We have investigated different neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Different features and normalization techniques are compared as well. We also proposed CTC-loss modification using s…
▽ More
In this work we present simple grapheme-based system for low-resource speech recognition using Babel data for Turkish spontaneous speech (80 hours). We have investigated different neural network architectures performance, including fully-convolutional, recurrent and ResNet with GRU. Different features and normalization techniques are compared as well. We also proposed CTC-loss modification using segmentation during training, which leads to improvement while decoding with small beam size. Our best model achieved word error rate of 45.8%, which is the best reported result for end-to-end systems using in-domain data for this task, according to our knowledge.
△ Less
Submitted 2 July, 2018;
originally announced July 2018.
-
Investigation of Using VAE for i-Vector Speaker Verification
Authors:
Timur Pekhovsky,
Maxim Korenevsky
Abstract:
New system for i-vector speaker recognition based on variational autoencoder (VAE) is investigated. VAE is a promising approach for developing accurate deep nonlinear generative models of complex data. Experiments show that VAE provides speaker embedding and can be effectively trained in an unsupervised manner. LLR estimate for VAE is developed. Experiments on NIST SRE 2010 data demonstrate its co…
▽ More
New system for i-vector speaker recognition based on variational autoencoder (VAE) is investigated. VAE is a promising approach for developing accurate deep nonlinear generative models of complex data. Experiments show that VAE provides speaker embedding and can be effectively trained in an unsupervised manner. LLR estimate for VAE is developed. Experiments on NIST SRE 2010 data demonstrate its correctness. Additionally, we show that the performance of VAE-based system in the i-vectors space is close to that of the diagonal PLDA. Several interesting results are also observed in the experiments with $β$-VAE. In particular, we found that for $β\ll 1$, VAE can be trained to capture the features of complex input data distributions in an effective way, which is hard to obtain in the standard VAE ($β=1$).
△ Less
Submitted 25 May, 2017;
originally announced May 2017.