-
Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Authors:
Sara Barahona,
Anna Silnova,
Ladislav Mošner,
Junyi Peng,
Oldřich Plchot,
Johan Rohdin,
Lin Zhang,
Jiangyu Han,
Petr Palka,
Federico Landini,
Lukáš Burget,
Themos Stafylakis,
Sandro Cumani,
Dominik Boboš,
Miroslav Hlavaček,
Martin Kodovsky,
Tomáš Pavlíček
Abstract:
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the p…
▽ More
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our experiments show practical recipes for developing state-of-the-art frontends for speaker recognition.
△ Less
Submitted 21 May, 2025;
originally announced May 2025.
-
Leveraging Self-Supervised Learning for Speaker Diarization
Authors:
Jiangyu Han,
Federico Landini,
Johan Rohdin,
Anna Silnova,
Mireia Diez,
Lukas Burget
Abstract:
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, but their application on speaker diarization is somehow limited. In this work, we explore using WavLM to alleviate the problem of data scarci…
▽ More
End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, but their application on speaker diarization is somehow limited. In this work, we explore using WavLM to alleviate the problem of data scarcity for neural diarization training. We use the same pipeline as Pyannote and improve the local end-to-end neural diarization with WavLM and Conformer. Experiments on far-field AMI, AISHELL-4, and AliMeeting datasets show that our method substantially outperforms the Pyannote baseline and achieves new state-of-the-art results on AMI and AISHELL-4, respectively. In addition, by analyzing the system performance under different data quantity scenarios, we show that WavLM representations are much more robust against data scarcity than filterbank features, enabling less data hungry training strategies. Furthermore, we found that simulated data, usually used to train endto-end diarization models, does not help when using WavLM in our experiments. Additionally, we also evaluate our model on the recent CHiME8 NOTSOFAR-1 task where it achieves better performance than the Pyannote baseline. Our source code is publicly available at https://github.com/BUTSpeechFIT/DiariZen.
△ Less
Submitted 21 October, 2024; v1 submitted 14 September, 2024;
originally announced September 2024.
-
BUT Systems and Analyses for the ASVspoof 5 Challenge
Authors:
Johan Rohdin,
Lin Zhang,
Oldřich Plchot,
Vojtěch Staněk,
David Mihola,
Junyi Peng,
Themos Stafylakis,
Dmitriy Beveraki,
Anna Silnova,
Jan Brukner,
Lukáš Burget
Abstract:
This paper describes the BUT submitted systems for the ASVspoof 5 challenge, along with analyses. For the conventional deepfake detection task, we use ResNet18 and self-supervised models for the closed and open conditions, respectively. In addition, we analyze and visualize different combinations of speaker information and spoofing information as label schemes for training. For spoofing-robust aut…
▽ More
This paper describes the BUT submitted systems for the ASVspoof 5 challenge, along with analyses. For the conventional deepfake detection task, we use ResNet18 and self-supervised models for the closed and open conditions, respectively. In addition, we analyze and visualize different combinations of speaker information and spoofing information as label schemes for training. For spoofing-robust automatic speaker verification (SASV), we introduce effective priors and propose using logistic regression to jointly train affine transformations of the countermeasure scores and the automatic speaker verification scores in such a way that the SASV LLR is optimized.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
DiaCorrect: Error Correction Back-end For Speaker Diarization
Authors:
Jiangyu Han,
Federico Landini,
Johan Rohdin,
Mireia Diez,
Lukas Burget,
Yuhang Cao,
Heng Lu,
Jan Cernocky
Abstract:
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia…
▽ More
In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initial system's outputs, DiaCorrect can automatically correct the initial speaker activities to minimize the diarization errors. Experiments on 2-speaker telephony data show that the proposed DiaCorrect can effectively improve the initial model's results. Our source code is publicly available at https://github.com/BUTSpeechFIT/diacorrect.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.
-
Speaker embeddings by modeling channel-wise correlations
Authors:
Themos Stafylakis,
Johan Rohdin,
Lukas Burget
Abstract:
Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statisti…
▽ More
Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statistics. The method is inspired by style-transfer methods in computer vision, where the style of an image, modeled by the matrix of channel-wise correlations, is transferred to another image, in order to produce a new image having the style of the first and the content of the second. By drawing analogies between image style and speaker characteristics, and between image content and phonetic sequence, we explore the use of such channel-wise correlations features to train a ResNet architecture in an end-to-end fashion. Our experiments on VoxCeleb demonstrate the effectiveness of the proposed pooling method in speaker recognition.
△ Less
Submitted 7 July, 2021; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Analysis of the BUT Diarization System for VoxConverse Challenge
Authors:
Federico Landini,
Ondřej Glembek,
Pavel Matějka,
Johan Rohdin,
Lukáš Burget,
Mireia Diez,
Anna Silnova
Abstract:
This paper describes the system developed by the BUT team for the fourth track of the VoxCeleb Speaker Recognition Challenge, focusing on diarization on the VoxConverse dataset. The system consists of signal pre-processing, voice activity detection, speaker embedding extraction, an initial agglomerative hierarchical clustering followed by diarization using a Bayesian hidden Markov model, a reclust…
▽ More
This paper describes the system developed by the BUT team for the fourth track of the VoxCeleb Speaker Recognition Challenge, focusing on diarization on the VoxConverse dataset. The system consists of signal pre-processing, voice activity detection, speaker embedding extraction, an initial agglomerative hierarchical clustering followed by diarization using a Bayesian hidden Markov model, a reclustering step based on per-speaker global embeddings and overlapped speech detection and handling. We provide comparisons for each of the steps and share the implementation of the most relevant modules of our system. Our system scored second in the challenge in terms of the primary metric (diarization error rate) and first according to the secondary metric (Jaccard error rate).
△ Less
Submitted 9 February, 2021; v1 submitted 22 October, 2020;
originally announced October 2020.
-
Probabilistic embeddings for speaker diarization
Authors:
Anna Silnova,
Niko Brümmer,
Johan Rohdin,
Themos Stafylakis,
Lukáš Burget
Abstract:
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA sco…
▽ More
Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA scoring backend. These precisions quantify the uncertainty about what the values of the embeddings might have been if they had been extracted from high quality speech segments. The proposed probabilistic embeddings (x-vectors with precisions) are interfaced with the PLDA model by treating the x-vectors as hidden variables and marginalizing them out. We apply the proposed probabilistic embeddings as input to an agglomerative hierarchical clustering (AHC) algorithm to do diarization in the DIHARD'19 evaluation set. We compute the full PLDA likelihood 'by the book' for each clustering hypothesis that is considered by AHC. We do joint discriminative training of the PLDA parameters and of the probabilistic x-vector extractor. We demonstrate accuracy gains relative to a baseline AHC algorithm, applied to traditional xvectors (without uncertainty), and which uses averaging of binary log-likelihood-ratios, rather than by-the-book scoring.
△ Less
Submitted 6 November, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge
Authors:
Hossein Zeinali,
Themos Stafylakis,
Georgia Athanasopoulou,
Johan Rohdin,
Ioannis Gkinis,
Lukáš Burget,
Jan "Honza'' Černocký
Abstract:
In this paper, we present the system description of the joint efforts of Brno University of Technology (BUT) and Omilia -- Conversational Intelligence for the ASVSpoof2019 Spoofing and Countermeasures Challenge. The primary submission for Physical access (PA) is a fusion of two VGG networks, trained on single and two-channels features. For Logical access (LA), our primary system is a fusion of VGG…
▽ More
In this paper, we present the system description of the joint efforts of Brno University of Technology (BUT) and Omilia -- Conversational Intelligence for the ASVSpoof2019 Spoofing and Countermeasures Challenge. The primary submission for Physical access (PA) is a fusion of two VGG networks, trained on single and two-channels features. For Logical access (LA), our primary system is a fusion of VGG and the recently introduced SincNet architecture. The results on PA show that the proposed networks yield very competitive performance in all conditions and achieved 86\:\% relative improvement compared to the official baseline. On the other hand, the results on LA showed that although the proposed architecture and training strategy performs very well on certain spoofing attacks, it fails to generalize to certain attacks that are unseen during training.
△ Less
Submitted 13 July, 2019;
originally announced July 2019.
-
Self-supervised speaker embeddings
Authors:
Themos Stafylakis,
Johan Rohdin,
Oldrich Plchot,
Petr Mizera,
Lukas Burget
Abstract:
Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due to the classification loss over training speakers. In this paper, we explore an alternative training strategy to enable the use of unlabelled utterances in training. We propose to train speaker embedding extractors via reconstructing the frames of a target speech segment, given the in…
▽ More
Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due to the classification loss over training speakers. In this paper, we explore an alternative training strategy to enable the use of unlabelled utterances in training. We propose to train speaker embedding extractors via reconstructing the frames of a target speech segment, given the inferred embedding of another speech segment of the same utterance. We do this by attaching to the standard speaker embedding extractor a decoder network, which we feed not merely with the speaker embedding, but also with the estimated phone sequence of the target frame sequence. The reconstruction loss can be used either as a single objective, or be combined with the standard speaker classification loss. In the latter case, it acts as a regularizer, encouraging generalizability to speakers unseen during training. In all cases, the proposed architectures are trained from scratch and in an end-to-end fashion. We demonstrate the benefits from the proposed approach on VoxCeleb and Speakers in the wild, and we report notable improvements over the baseline.
△ Less
Submitted 23 April, 2019; v1 submitted 6 April, 2019;
originally announced April 2019.
-
Speaker verification using end-to-end adversarial language adaptation
Authors:
Johan Rohdin,
Themos Stafylakis,
Anna Silnova,
Hossein Zeinali,
Lukas Burget,
Oldrich Plchot
Abstract:
In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target dom…
▽ More
In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target domains (i.e. languages), while preserving their capacity in recognizing speakers. Neural architectures for extracting utterance-level representations enable us to apply adversarial adaptation methods in an end-to-end fashion and train the network jointly with the standard cross-entropy loss. We examine several configurations, such as the use of (pseudo-)labels on the target domain as well as domain labels in the feature extractor, and we demonstrate the effectiveness of our method on the challenging NIST SRE16 and SRE18 benchmarks.
△ Less
Submitted 6 November, 2018;
originally announced November 2018.
-
How to Improve Your Speaker Embeddings Extractor in Generic Toolkits
Authors:
Hossein Zeinali,
Lukas Burget,
Johan Rohdin,
Themos Stafylakis,
Jan Cernocky
Abstract:
Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, diff…
▽ More
Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, different methods for preventing overfitting as well as alternative non-linearities that can be used instead of Rectifier Linear Units. In addition, we investigate the difference in performance between TDNN and CNN, and between two types of attention mechanism. Experimental results on Speaker in the Wild, SRE 2016 and SRE 2018 datasets demonstrate the effectiveness of the proposed implementation.
△ Less
Submitted 5 November, 2018;
originally announced November 2018.
-
End-to-end DNN Based Speaker Recognition Inspired by i-vector and PLDA
Authors:
Johan Rohdin,
Anna Silnova,
Mireia Diez,
Oldrich Plchot,
Pavel Matejka,
Lukas Burget
Abstract:
Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work…
▽ More
Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work, we develop an end-to-end speaker verification system that is initialized to mimic an i-vector + PLDA baseline. The system is then further trained in an end-to-end manner but regularized so that it does not deviate too far from the initial system. In this way we mitigate overfitting which normally limits the performance of end-to-end systems. The proposed system outperforms the i-vector + PLDA baseline on both long and short duration utterances.
△ Less
Submitted 8 January, 2018; v1 submitted 6 October, 2017;
originally announced October 2017.