Skip to main content

Showing 1–6 of 6 results for author: Klement, D

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.15456  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Factorized RVQ-GAN For Disentangled Speech Tokenization

    Authors: Sameer Khurana, Dominik Klement, Antoine Laurent, Dominik Bobos, Juraj Novosad, Peter Gazdik, Ellen Zhang, Zili Huang, Amir Hussein, Ricard Marxer, Yoshiki Masuyama, Ryo Aihara, Chiori Hori, Francois G. Germain, Gordon Wichern, Jonathan Le Roux

    Abstract: We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on Engl… ▽ More

    Submitted 18 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  2. arXiv:2506.13414  [pdf, ps, other

    eess.AS

    BUT System for the MLC-SLM Challenge

    Authors: Alexander Polok, Jiangyu Han, Dominik Klement, Samuele Cornell, Jan Černocký, Lukáš Burget

    Abstract: We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, dem… ▽ More

    Submitted 16 June, 2025; originally announced June 2025.

  3. arXiv:2501.00114  [pdf, other

    eess.AS cs.SD

    DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

    Authors: Alexander Polok, Dominik Klement, Martin Kocour, Jiangyu Han, Federico Landini, Bolaji Yusuf, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

    Abstract: Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW e… ▽ More

    Submitted 30 December, 2024; originally announced January 2025.

  4. arXiv:2411.02165  [pdf, other

    eess.AS cs.SD

    Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

    Authors: Petr Pálka, Federico Landini, Dominik Klement, Mireia Diez, Anna Silnova, Marc Delcroix, Lukáš Burget

    Abstract: In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independen… ▽ More

    Submitted 4 November, 2024; originally announced November 2024.

  5. arXiv:2409.09543  [pdf, other

    eess.AS cs.SD

    Target Speaker ASR with Whisper

    Authors: Alexander Polok, Dominik Klement, Matthew Wiesner, Sanjeev Khudanpur, Jan Černocký, Lukáš Burget

    Abstract: We propose a novel approach to enable the use of large, single-speaker ASR models, such as Whisper, for target speaker ASR. The key claim of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization outpu… ▽ More

    Submitted 16 January, 2025; v1 submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted to ICASSP 2025

  6. arXiv:2310.02732  [pdf, ps, other

    eess.AS cs.SD

    Discriminative Training of VBx Diarization

    Authors: Dominik Klement, Mireia Diez, Federico Landini, Lukáš Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara

    Abstract: Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant analysis (PLDA) for speaker distribution modeling, and Bayesian inference to estimate the assignment of x-vectors to speakers. This paper presents a new framework fo… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024