Skip to main content

Showing 1–50 of 58 results for author: Burget, L

Searching in archive eess. Search in all archives.
.
  1. arXiv:2505.24111  [pdf, other

    eess.AS

    Fine-tune Before Structured Pruning: Towards Compact and Accurate Self-Supervised Models for Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Jan Cernocky, Lukas Burget

    Abstract: Self-supervised learning (SSL) models like WavLM can be effectively utilized when building speaker diarization systems but are often large and slow, limiting their use in resource constrained scenarios. Previous studies have explored compression techniques, but usually for the price of degraded performance at high pruning ratios. In this work, we propose to compress SSL models through structured p… ▽ More

    Submitted 29 May, 2025; originally announced May 2025.

    Comments: Accepted by INTERSPEECH 2025

  2. arXiv:2505.15320  [pdf, ps, other

    eess.AS cs.SD

    Analysis of ABC Frontend Audio Systems for the NIST-SRE24

    Authors: Sara Barahona, Anna Silnova, Ladislav Mošner, Junyi Peng, Oldřich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini, Lukáš Burget, Themos Stafylakis, Sandro Cumani, Dominik Boboš, Miroslav Hlavaček, Martin Kodovsky, Tomáš Pavlíček

    Abstract: We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the p… ▽ More

    Submitted 21 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  3. arXiv:2501.00114  [pdf, other

    eess.AS cs.SD

    DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

    Authors: Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico Landini, Bolaji Yusuf, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

    Abstract: Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW e… ▽ More

    Submitted 30 December, 2024; originally announced January 2025.

  4. arXiv:2411.02165  [pdf, other

    eess.AS cs.SD

    Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

    Authors: Petr Pálka, Federico Landini, Dominik Klement, Mireia Diez, Anna Silnova, Marc Delcroix, Lukáš Burget

    Abstract: In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independen… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  5. arXiv:2410.17437  [pdf, other

    eess.AS

    Improving Automatic Speech Recognition with Decoder-Centric Regularisation in Encoder-Decoder Models

    Authors: Alexander Polok, Santosh Kesiraju, Karel Beneš, Lukáš Burget, Jan Černocký

    Abstract: This paper proposes a simple yet effective way of regularising the encoder-decoder-based automatic speech recognition (ASR) models that enhance the robustness of the model and improve the generalisation to out-of-domain scenarios. The proposed approach is dubbed as $\textbf{De}$coder-$\textbf{C}$entric $\textbf{R}$egularisation in $\textbf{E}$ncoder-$\textbf{D}$ecoder (DeCRED) architecture for ASR… ▽ More

    Submitted 22 October, 2024; originally announced October 2024.

  6. arXiv:2410.02364  [pdf, ps, other

    eess.AS

    State-of-the-art Embeddings with Video-free Segmentation of the Source VoxCeleb Data

    Authors: Sara Barahona, Ladislav Mošner, Themos Stafylakis, Oldřich Plchot, Junyi Peng, Lukáš Burget, Jan Černocký

    Abstract: In this paper, we refine and validate our method for training speaker embedding extractors using weak annotations. More specifically, we use only the audio stream of the source VoxCeleb videos and the names of the celebrities without knowing the time intervals in which they appear in the recording. We experiment with hyperparameters and embedding extractors based on ResNet and WavLM. We show that… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: This work has been submitted to the IEEE for possible publication

  7. arXiv:2409.15234  [pdf, other

    eess.AS cs.SD

    CA-MHFA: A Context-Aware Multi-Head Factorized Attentive Pooling for SSL-Based Speaker Verification

    Authors: Junyi Peng, Ladislav Mošner, Lin Zhang, Oldřich Plchot, Themos Stafylakis, Lukáš Burget, Jan Černocký

    Abstract: Self-supervised learning (SSL) models for speaker verification (SV) have gained significant attention in recent years. However, existing SSL-based SV systems often struggle to capture local temporal dependencies and generalize across different tasks. In this paper, we propose context-aware multi-head factorized attentive pooling (CA-MHFA), a lightweight framework that incorporates contextual infor… ▽ More

    Submitted 23 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  8. arXiv:2409.09543  [pdf, other

    eess.AS cs.SD

    Target Speaker ASR with Whisper

    Authors: Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

    Abstract: We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization outpu… ▽ More

    Submitted 16 January, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted to ICASSP 2025

  9. arXiv:2409.09408  [pdf, other

    eess.AS cs.SD

    Leveraging Self-Supervised Learning for Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Anna Silnova, Mireia Diez, Lukas Burget

    Abstract: End-to-end neural diarization has evolved considerably over the past few years, but data scarcity is still a major obstacle for further improvements. Self-supervised learning methods such as WavLM have shown promising performance on several downstream tasks, but their application on speaker diarization is somehow limited. In this work, we explore using WavLM to alleviate the problem of data scarci… ▽ More

    Submitted 21 October, 2024; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025; New results are updated but conclusions are exactly the same as the original one

  10. arXiv:2408.11152  [pdf, other

    cs.SD eess.AS

    BUT Systems and Analyses for the ASVspoof 5 Challenge

    Authors: Johan Rohdin, Lin Zhang, Oldřich Plchot, Vojtěch Staněk, David Mihola, Junyi Peng, Themos Stafylakis, Dmitriy Beveraki, Anna Silnova, Jan Brukner, Lukáš Burget

    Abstract: This paper describes the BUT submitted systems for the ASVspoof 5 challenge, along with analyses. For the conventional deepfake detection task, we use ResNet18 and self-supervised models for the closed and open conditions, respectively. In addition, we analyze and visualize different combinations of speaker information and spoofing information as label schemes for training. For spoofing-robust aut… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 8 pages, ASVspoof 5 Workshop (Interspeech2024 Satellite)

  11. arXiv:2406.12622  [pdf, ps, other

    eess.AS

    Challenging margin-based speaker embedding extractors by using the variational information bottleneck

    Authors: Themos Stafylakis, Anna Silnova, Johan Rohdin, Oldrich Plchot, Lukas Burget

    Abstract: Speaker embedding extractors are typically trained using a classification loss over the training speakers. During the last few years, the standard softmax/cross-entropy loss has been replaced by the margin-based losses, yielding significant improvements in speaker recognition accuracy. Motivated by the fact that the margin merely reduces the logit of the target speaker during training, we consider… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  12. arXiv:2403.07767  [pdf, ps, other

    eess.AS cs.LG eess.SP

    Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets

    Authors: Jan Pešán, Santosh Kesiraju, Lukáš Burget, Jan ''Honza'' Černocký

    Abstract: Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets is seldom scrutinized for text-dependency. This paper critically evaluates the prevalent assumption that machine learning models trained on such datasets genuine… ▽ More

    Submitted 18 October, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

  13. arXiv:2402.19325  [pdf, other

    cs.SD eess.AS

    Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

    Authors: Lin Zhang, Themos Stafylakis, Federico Landini, Mireia Diez, Anna Silnova, Lukáš Burget

    Abstract: In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristi… ▽ More

    Submitted 20 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Accepted to Odyssey 2024. This arXiv version includes an appendix for more visualizations. Code: https://github.com/BUTSpeechFIT/EENDEDA_VIB

  14. arXiv:2312.04324  [pdf, other

    eess.AS cs.SD

    DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

    Authors: Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget

    Abstract: Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-… ▽ More

    Submitted 1 June, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  15. arXiv:2310.02732  [pdf, ps, other

    eess.AS cs.SD

    Discriminative Training of VBx Diarization

    Authors: Dominik Klement, Mireia Diez, Federico Landini, Lukáš Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara

    Abstract: Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant analysis (PLDA) for speaker distribution modeling, and Bayesian inference to estimate the assignment of x-vectors to speakers. This paper presents a new framework fo… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  16. arXiv:2309.08377  [pdf, other

    eess.AS cs.CL cs.SD

    DiaCorrect: Error Correction Back-end For Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

    Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  17. arXiv:2305.13580  [pdf, other

    eess.AS cs.SD

    Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

    Authors: Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki

    Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddi… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  18. arXiv:2305.12579  [pdf, other

    cs.CL cs.SD eess.AS

    Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems

    Authors: Karel Beneš, Martin Kocour, Lukáš Burget

    Abstract: End-to-end (e2e) systems have recently gained wide popularity in automatic speech recognition. However, these systems do generally not provide well-calibrated word-level confidences. In this paper, we propose Hystoc, a simple method for obtaining word-level confidences from hypothesis-level scores. Hystoc is an iterative alignment procedure which turns hypotheses from an n-best output of the ASR s… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

  19. arXiv:2305.10517  [pdf, other

    eess.AS

    Improving Speaker Verification with Self-Pretrained Transformer Models

    Authors: Junyi Peng, Oldřich Plchot, Themos Stafylakis, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models a… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  20. arXiv:2211.06750  [pdf, other

    eess.AS cs.SD

    Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization

    Authors: Federico Landini, Mireia Diez, Alicia Lozano-Diez, Lukáš Burget

    Abstract: End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed… ▽ More

    Submitted 24 February, 2023; v1 submitted 12 November, 2022; originally announced November 2022.

    Comments: Accepted by ICASSP 2023

  21. arXiv:2211.01756  [pdf, other

    eess.AS cs.SD

    Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

    Authors: Sofoklis Kakouros, Themos Stafylakis, Ladislav Mosner, Lukas Burget

    Abstract: When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognit… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to IEEE-ICASSP 2023

  22. arXiv:2210.16032  [pdf, other

    eess.AS cs.SD eess.SP

    Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

    Authors: Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldřich Plchot, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a compreh… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: submitted to ICASSP2023

  23. arXiv:2210.15441  [pdf, ps, other

    cs.SD eess.AS stat.ML

    Toroidal Probabilistic Spherical Discriminant Analysis

    Authors: Anna Silnova, Niko Brümmer, Albert Swart, Lukáš Burget

    Abstract: In speaker recognition, where speech segments are mapped to embeddings on the unit hypersphere, two scoring back-ends are commonly used, namely cosine scoring and PLDA. We have recently proposed PSDA, an analog to PLDA that uses Von Mises-Fisher distributions instead of Gaussians. In this paper, we present toroidal PSDA (T-PSDA). It extends PSDA with the ability to model within and between-speaker… ▽ More

    Submitted 27 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  24. arXiv:2210.09513  [pdf, other

    eess.AS cs.SD

    Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

    Authors: Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

    Abstract: Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alte… ▽ More

    Submitted 15 October, 2022; originally announced October 2022.

    Comments: Accepted at IEEE-SLT 2022

  25. arXiv:2210.01273  [pdf, other

    eess.AS

    An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

    Authors: Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

    Abstract: In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT2022

  26. arXiv:2204.00890  [pdf, other

    eess.AS cs.SD

    From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization

    Authors: Federico Landini, Alicia Lozano-Diez, Mireia Diez, Lukáš Burget

    Abstract: End-to-end neural diarization (EEND) is nowadays one of the most prominent research topics in speaker diarization. EEND presents an attractive alternative to standard cascaded diarization systems since a single system is trained at once to deal with the whole diarization problem. Several EEND variants and approaches are being proposed, however, all these models require large amounts of annotated d… ▽ More

    Submitted 25 June, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

    Comments: Accepted at Interspeech 2022

  27. arXiv:2204.00770  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Speaker adaptation for Wav2vec2 based dysarthric ASR

    Authors: Murali Karthick Baskar, Tim Herzig, Diana Nguyen, Mireia Diez, Tim Polzehl, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: Dysarthric speech recognition has posed major challenges due to lack of training data and heavy mismatch in speaker characteristics. Recent ASR systems have benefited from readily available pretrained models such as wav2vec2 to improve the recognition performance. Speaker adaptation using fMLLR and xvectors have provided major gains for dysarthric speech with very little adaptation data. However,… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  28. arXiv:2203.15436  [pdf, other

    eess.AS

    Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

    Authors: Themos Stafylakis, Ladislav Mošner, Oldřich Plchot, Johan Rohdin, Anna Silnova, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parame… ▽ More

    Submitted 9 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at Interspeech 2022

  29. arXiv:2203.10300  [pdf, other

    eess.AS

    Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

    Authors: Anna Silnova, Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Pavel Matejka, Lukas Burget, Ondrej Glembek, Niko Brummer

    Abstract: In this paper, we analyze the behavior and performance of speaker embeddings and the back-end scoring model under domain and language mismatch. We present our findings regarding ResNet-based speaker embedding architectures and show that reduced temporal stride yields improved performance. We then consider a PLDA back-end and show how a combination of small speaker subspace, language-dependent PLDA… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: Submitted to Odyssey 2022, under review

  30. arXiv:2112.13520  [pdf, other

    eess.AS

    DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

    Authors: Jiangyu Han, Yanhua Long, Lukas Burget, Jan Cernocky

    Abstract: In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions.… ▽ More

    Submitted 29 January, 2022; v1 submitted 27 December, 2021; originally announced December 2021.

    Comments: accepted by ICASSP 2022

  31. arXiv:2111.06458  [pdf, other

    eess.AS cs.LG cs.SD

    MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

    Authors: Ladislav Mošner, Oldřich Plchot, Lukáš Burget, Jan Černocký

    Abstract: Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem o… ▽ More

    Submitted 11 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022

  32. arXiv:2111.00009  [pdf, other

    eess.AS cs.LG cs.SD

    Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

    Authors: Martin Kocour, Kateřina Žmolíková, Lucas Ondel, Ján Švec, Marc Delcroix, Tsubasa Ochiai, Lukáš Burget, Jan Černocký

    Abstract: In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the aco… ▽ More

    Submitted 15 April, 2022; v1 submitted 31 October, 2021; originally announced November 2021.

    Comments: submitted to Interspeech 2022

  33. arXiv:2107.06155  [pdf, other

    cs.CL cs.SD eess.AS

    The IWSLT 2021 BUT Speech Translation Systems

    Authors: Hari Krishna Vydana, Martin Karafi'at, Luk'as Burget, "Honza" Cernock'y

    Abstract: The paper describes BUT's English to German offline speech translation(ST) systems developed for IWSLT2021. They are based on jointly trained Automatic Speech Recognition-Machine Translation models. Their performances is evaluated on MustC-Common test set. In this work, we study their efficiency from the perspective of having a large amount of separate ASR training data and MT training data, and a… ▽ More

    Submitted 13 July, 2021; originally announced July 2021.

  34. arXiv:2104.07474  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

    Authors: Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Ramon Fernandez Astudillo, Jan "Honza'' Černocký

    Abstract: Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here we propose an enhanced ASR-TTS (EAT) model that incorporates two main features: 1) The ASR$\rightarrow$TTS direction is equipped with a language model reward to penalize the ASR hypotheses before forwarding it to TTS. 2) In the TTS$\rightarrow$ASR direction, a hyper-parameter is introduced to scale the attention context f… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

  35. arXiv:2104.02571  [pdf, ps, other

    eess.AS cs.CV

    Speaker embeddings by modeling channel-wise correlations

    Authors: Themos Stafylakis, Johan Rohdin, Lukas Burget

    Abstract: Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statisti… ▽ More

    Submitted 7 July, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: Accepted at Interspeech 2021

  36. arXiv:2012.14952  [pdf, other

    eess.AS cs.SD

    Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks

    Authors: Federico Landini, Ján Profant, Mireia Diez, Lukáš Burget

    Abstract: The recently proposed VBx diarization method uses a Bayesian hidden Markov model to find speaker clusters in a sequence of x-vectors. In this work we perform an extensive comparison of performance of the VBx diarization with other approaches in the literature and we show that VBx achieves superior performance on three of the most popular datasets for evaluating diarization: CALLHOME, AMI and DIHAR… ▽ More

    Submitted 29 December, 2020; originally announced December 2020.

    Comments: Submitted to Computer Speech and Language, Special Issue on Separation, Recognition, and Diarization of Conversational Speech

  37. arXiv:2011.11984  [pdf, other

    eess.AS

    Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation

    Authors: Katerina Zmolikova, Marc Delcroix, Lukáš Burget, Tomohiro Nakatani, Jan "Honza" Černocký

    Abstract: In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work,… ▽ More

    Submitted 24 November, 2020; originally announced November 2020.

    Comments: 8 pages, 3 figures, to be published in SLT2021

  38. arXiv:2011.03115  [pdf, ps, other

    eess.AS cs.LG cs.SD

    A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

    Authors: Bolaji Yusuf, Lucas Ondel, Lukas Burget, Jan Cernocky, Murat Saraclar

    Abstract: In this work, we propose a hierarchical subspace model for acoustic unit discovery. In this approach, we frame the task as one of learning embeddings on a low-dimensional phonetic subspace, and simultaneously specify the subspace itself as an embedding on a hyper-subspace. We train the hyper-subspace on a set of transcribed languages and transfer it to the target language. In the target language,… ▽ More

    Submitted 9 November, 2020; v1 submitted 4 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021

  39. arXiv:2010.11718  [pdf, ps, other

    eess.AS cs.SD

    Analysis of the BUT Diarization System for VoxConverse Challenge

    Authors: Federico Landini, Ondřej Glembek, Pavel Matějka, Johan Rohdin, Lukáš Burget, Mireia Diez, Anna Silnova

    Abstract: This paper describes the system developed by the BUT team for the fourth track of the VoxCeleb Speaker Recognition Challenge, focusing on diarization on the VoxConverse dataset. The system consists of signal pre-processing, voice activity detection, speaker embedding extraction, an initial agglomerative hierarchical clustering followed by diarization using a Bayesian hidden Markov model, a reclust… ▽ More

    Submitted 9 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP 2021

  40. arXiv:2004.12111  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Jointly Trained Transformers models for Spoken Language Translation

    Authors: Hari Krishna Vydana, Martin Karafi'at, Katerina Zmolikova, Luk'as Burget, Honza Cernocky

    Abstract: Conventional spoken language translation (SLT) systems are pipeline based systems, where we have an Automatic Speech Recognition (ASR) system to convert the modality of source from speech to text and a Machine Translation (MT) systems to translate source text to text in target language. Recent progress in the sequence-sequence architectures have reduced the performance gap between the pipeline bas… ▽ More

    Submitted 25 April, 2020; originally announced April 2020.

    Comments: 7-pages,3 figures

    ACM Class: I.2.7

  41. arXiv:2004.04096  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    Probabilistic embeddings for speaker diarization

    Authors: Anna Silnova, Niko Brümmer, Johan Rohdin, Themos Stafylakis, Lukáš Burget

    Abstract: Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA sco… ▽ More

    Submitted 6 November, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

    Comments: Awarded: Jack Godfrey Best Student Paper Award, at Odyssey 2020: The Speaker and Language Recognition Workshop, Tokio

  42. arXiv:2002.11356  [pdf, ps, other

    eess.AS

    BUT System for the Second DIHARD Speech Diarization Challenge

    Authors: Federico Landini, Shuai Wang, Mireia Diez, Lukáš Burget, Pavel Matějka, Kateřina Žmolíková, Ladislav Mošner, Anna Silnova, Oldřich Plchot, Ondřej Novotný, Hossein Zeinali, Johan Rohdin

    Abstract: This paper describes the winning systems developed by the BUT team for the four tracks of the Second DIHARD Speech Diarization Challenge. For tracks 1 and 2 the systems were mainly based on performing agglomerative hierarchical clustering (AHC) of x-vectors, followed by another x-vector clustering based on Bayes hidden Markov model and variational Bayes inference. We provide a comparison of the im… ▽ More

    Submitted 26 February, 2020; originally announced February 2020.

  43. arXiv:1912.06311  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Short-duration Speaker Verification (SdSV) Challenge 2021: the Challenge Evaluation Plan

    Authors: Hossein Zeinali, Kong Aik Lee, Jahangir Alam, Lukas Burget

    Abstract: This document describes the Short-duration Speaker Verification (SdSV) Challenge 2021. The main goal of the challenge is to evaluate new technologies for text-dependent (TD) and text-independent (TI) speaker verification (SV) in a short duration scenario. The proposed challenge evaluates SdSV with varying degree of phonetic overlap between the enrollment and test utterances (cross-lingual). It is… ▽ More

    Submitted 24 March, 2021; v1 submitted 12 December, 2019; originally announced December 2019.

  44. arXiv:1912.03627  [pdf, ps, other

    eess.AS cs.CL cs.SD

    A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: the DeepMine Database

    Authors: Hossein Zeinali, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: DeepMine is a speech database in Persian and English designed to build and evaluate text-dependent, text-prompted, and text-independent speaker verification, as well as Persian speech recognition systems. It contains more than 1850 speakers and 540 thousand recordings overall, more than 480 hours of speech are transcribed. It is the first public large-scale speaker verification database in Persian… ▽ More

    Submitted 8 December, 2019; originally announced December 2019.

  45. arXiv:1910.08847  [pdf, ps, other

    eess.AS

    BUT System Description for DIHARD Speech Diarization Challenge 2019

    Authors: Federico Landini, Shuai Wang, Mireia Diez, Lukáš Burget, Pavel Matějka, Kateřina Žmolíková, Ladislav Mošner, Oldřich Plchot, Ondřej Novotný, Hossein Zeinali, Johan Rohdin

    Abstract: This paper describes the systems developed by the BUT team for the four tracks of the second DIHARD speech diarization challenge. For tracks 1 and 2 the systems were based on performing agglomerative hierarchical clustering (AHC) over x-vectors, followed by the Bayesian Hidden Markov Model (HMM) with eigenvoice priors applied at x-vector level followed by the same approach applied at frame level.… ▽ More

    Submitted 19 October, 2019; originally announced October 2019.

  46. arXiv:1907.07127  [pdf, ps, other

    eess.AS cs.SD

    Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge

    Authors: Hossein Zeinali, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dim… ▽ More

    Submitted 13 July, 2019; originally announced July 2019.

    Comments: arXiv admin note: text overlap with arXiv:1810.04273

  47. arXiv:1907.06112  [pdf, ps, other

    eess.AS cs.CL cs.SD

    BUT VOiCES 2019 System Description

    Authors: Hossein Zeinali, Pavel Matějka, Ladislav Mošner, Oldřich Plchot, Anna Silnova, Ondřej Novotný, Ján Profant, Ondřej Glembek, Lukáš Burget

    Abstract: This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptatio… ▽ More

    Submitted 13 July, 2019; originally announced July 2019.

  48. arXiv:1905.01152  [pdf, ps, other

    eess.AS cs.CL cs.IR cs.LG cs.SD

    Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

    Authors: Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Lukáš Burget, Jan Černocký

    Abstract: Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such models. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniqu… ▽ More

    Submitted 20 August, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

    Comments: INTERSPEECH 2019

  49. arXiv:1904.04235  [pdf, other

    eess.AS cs.SD

    Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition

    Authors: Ondrej Novotny, Oldrich Plchot, Ondrej Glembek, Lukas Burget

    Abstract: In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note: substantial text overlap with arXiv:1810.13183

  50. arXiv:1904.03876  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery

    Authors: Lucas Ondel, Hari Krishna Vydana, Lukáš Burget, Jan Černocký

    Abstract: This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target langu… ▽ More

    Submitted 2 July, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019 * corrected typos * Recalculated the segmentation using +-2 frames tolerance to comply with other publications