Skip to main content

Showing 1–20 of 20 results for author: Tawara, N

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.12500  [pdf, ps, other

    eess.AS cs.SD

    Mitigating Non-Target Speaker Bias in Guided Speaker Embedding

    Authors: Shota Horiguchi, Takanori Ashihara, Marc Delcroix, Atsushi Ando, Naohiro Tawara

    Abstract: Obtaining high-quality speaker embeddings in multi-speaker conditions is crucial for many applications. A recently proposed guided speaker embedding framework, which utilizes speech activities of target and non-target speakers as clues, drastically improved embeddings under severe overlap with small degradation in low-overlap cases. However, since extreme overlaps are rare in natural conversations… ▽ More

    Submitted 14 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025

  2. arXiv:2506.11605  [pdf, ps, other

    cs.SD eess.AS

    Dissecting the Segmentation Model of End-to-End Diarization with Vector Clustering

    Authors: Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, Shoko Araki, Hervé Bredin

    Abstract: End-to-End Neural Diarization with Vector Clustering is a powerful and practical approach to perform Speaker Diarization. Multiple enhancements have been proposed for the segmentation model of these pipelines, but their synergy had not been thoroughly evaluated. In this work, we provide an in-depth analysis on the impact of major architecture choices on the performance of the pipeline. We investig… ▽ More

    Submitted 13 June, 2025; originally announced June 2025.

    Comments: 37 pages, 18 figures. Submitted to Computer Speech & Language

  3. arXiv:2505.24545  [pdf, ps, other

    eess.AS cs.SD

    Pretraining Multi-Speaker Identification for Neural Speaker Diarization

    Authors: Shota Horiguchi, Atsushi Ando, Marc Delcroix, Naohiro Tawara

    Abstract: End-to-end speaker diarization enables accurate overlap-aware diarization by jointly estimating multiple speakers' speech activities in parallel. This approach is data-hungry, requiring a large amount of labeled conversational data, which cannot be fully obtained from real datasets alone. To address this issue, large-scale simulated data is often used for pretraining, but it requires enormous stor… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  4. arXiv:2502.09859  [pdf, ps, other

    eess.AS eess.SP

    Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

    Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

    Abstract: In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then… ▽ More

    Submitted 18 June, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

    Comments: 55 pages, 12 figures

  5. arXiv:2410.12182  [pdf, other

    eess.AS cs.SD

    Guided Speaker Embedding

    Authors: Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

    Abstract: This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the… ▽ More

    Submitted 1 January, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: Accepted to ICASSP 2025

  6. arXiv:2410.06459  [pdf, other

    cs.SD eess.AS

    Mamba-based Segmentation Model for Speaker Diarization

    Authors: Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, Shoko Araki

    Abstract: Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparin… ▽ More

    Submitted 9 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: 5 pages, 4 figures. Submitted to ICASSP 2025. Code at https://github.com/nttcslab-sp/mamba-diarization

  7. arXiv:2409.12528  [pdf, other

    cs.SD eess.AS

    SoundBeam meets M2D: Target Sound Extraction with Audio Foundation Model

    Authors: Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Daisuke Niizumi, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

    Abstract: Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound. The duality of the problem and the wide variety o… ▽ More

    Submitted 19 September, 2024; originally announced September 2024.

  8. arXiv:2409.05554  [pdf, other

    eess.AS

    NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

    Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

    Abstract: We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: 5 pages, 4 figures, CHiME8 challenge

  9. arXiv:2408.17142  [pdf, other

    eess.AS cs.SD

    Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings

    Authors: Shota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

    Abstract: This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker speech applications such as speaker diarization and target-speaker speech processing. Despite the challenges of obtaining a single speaker's speech without pre… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Accepted to IEEE SLT 2024

  10. arXiv:2408.00344  [pdf, other

    cs.SD eess.AS

    Interaural time difference loss for binaural target sound extraction

    Authors: Carlos Hernandez-Olivan, Marc Delcroix, Tsubasa Ochiai, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

    Abstract: Binaural target sound extraction (TSE) aims to extract a desired sound from a binaural mixture of arbitrary sounds while preserving the spatial cues of the desired sound. Indeed, for many applications, the target sound signal and its spatial cues carry important information about the sound source. Binaural TSE can be realized with a neural network trained to output only the desired sound given a b… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: Accepted in the International Workshop on Acoustic Signal Enhancement (IWAENC 2024)

  11. arXiv:2406.18972  [pdf, ps, other

    eess.AS cs.CL

    Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over

    Authors: Atsunori Ogawa, Naoyuki Kamo, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Takatomo Kano, Naohiro Tawara, Marc Delcroix

    Abstract: Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LL… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 5 pages

  12. arXiv:2312.14609  [pdf, ps, other

    eess.AS cs.CL

    BLSTM-Based Confidence Estimation for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Naohiro Tawara, Takatomo Kano, Marc Delcroix

    Abstract: Confidence estimation, in which we estimate the reliability of each recognized token (e.g., word, sub-word, and character) in automatic speech recognition (ASR) hypotheses and detect incorrectly recognized tokens, is an important function for developing ASR applications. In this study, we perform confidence estimation for end-to-end (E2E) ASR hypotheses. Recent E2E ASR systems show high performanc… ▽ More

    Submitted 22 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2021

  13. arXiv:2312.12764  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Lattice Rescoring Based on Large Ensemble of Complementary Neural Language Models

    Authors: Atsunori Ogawa, Naohiro Tawara, Marc Delcroix, Shoko Araki

    Abstract: We investigate the effectiveness of using a large ensemble of advanced neural language models (NLMs) for lattice rescoring on automatic speech recognition (ASR) hypotheses. Previous studies have reported the effectiveness of combining a small number of NLMs. In contrast, in this study, we combine up to eight NLMs, i.e., forward/backward long short-term memory/Transformer-LMs that are trained with… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2022

  14. arXiv:2310.11010  [pdf, ps, other

    eess.AS cs.CL

    Iterative Shallow Fusion of Backward Language Model for End-to-End Speech Recognition

    Authors: Atsunori Ogawa, Takafumi Moriya, Naoyuki Kamo, Naohiro Tawara, Marc Delcroix

    Abstract: We propose a new shallow fusion (SF) method to exploit an external backward language model (BLM) for end-to-end automatic speech recognition (ASR). The BLM has complementary characteristics with a forward language model (FLM), and the effectiveness of their combination has been confirmed by rescoring ASR hypotheses as post-processing. In the proposed SF, we iteratively apply the BLM to partial ASR… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted to ICASSP 2023

  15. arXiv:2310.02732  [pdf, ps, other

    eess.AS cs.SD

    Discriminative Training of VBx Diarization

    Authors: Dominik Klement, Mireia Diez, Federico Landini, Lukáš Burget, Anna Silnova, Marc Delcroix, Naohiro Tawara

    Abstract: Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant analysis (PLDA) for speaker distribution modeling, and Bayesian inference to estimate the assignment of x-vectors to speakers. This paper presents a new framework fo… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

    Comments: Submitted to ICASSP 2024

  16. arXiv:2309.12656  [pdf, other

    eess.AS cs.SD

    NTT speaker diarization system for CHiME-7: multi-domain, multi-microphone End-to-end and vector clustering diarization

    Authors: Naohiro Tawara, Marc Delcroix, Atsushi Ando, Atsunori Ogawa

    Abstract: This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using d… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: 5 pages, 5 figures, Submitted to ICASSP 2024

  17. arXiv:2305.13580  [pdf, other

    eess.AS cs.SD

    Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization

    Authors: Marc Delcroix, Naohiro Tawara, Mireia Diez, Federico Landini, Anna Silnova, Atsunori Ogawa, Tomohiro Nakatani, Lukas Burget, Shoko Araki

    Abstract: Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddi… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  18. arXiv:2105.09040  [pdf, other

    eess.AS cs.SD

    Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech

    Authors: Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

    Abstract: Recently, we proposed a novel speaker diarization method called End-to-End-Neural-Diarization-vector clustering (EEND-vector clustering) that integrates clustering-based and end-to-end neural network-based diarization approaches into one framework. The proposed method combines advantages of both frameworks, i.e. high diarization performance and handling of overlapped speech based on EEND, and robu… ▽ More

    Submitted 31 August, 2021; v1 submitted 19 May, 2021; originally announced May 2021.

    Comments: 5 pages, 1 figure, Interspeech2021. (Update to include a reference to the code)

  19. arXiv:2010.13366  [pdf, other

    eess.AS cs.SD stat.ML

    Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds

    Authors: Keisuke Kinoshita, Marc Delcroix, Naohiro Tawara

    Abstract: Recent diarization technologies can be categorized into two approaches, i.e., clustering and end-to-end neural approaches, which have different pros and cons. The clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. While it can be seen as a current state-of-the-art approach that works for various challenging data with reasonable r… ▽ More

    Submitted 4 February, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

    Comments: To appear in ICASSP 2021

  20. arXiv:2001.08378  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improving speaker discrimination of target speech extraction with time-domain SpeakerBeam

    Authors: Marc Delcroix, Tsubasa Ochiai, Katerina Zmolikova, Keisuke Kinoshita, Naohiro Tawara, Tomohiro Nakatani, Shoko Araki

    Abstract: Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

    Comments: 5 pages, 3 figures. Submitted to ICASSP 2020