Skip to main content

Showing 1–50 of 52 results for author: Horiguchi, S

.
  1. arXiv:2506.05688  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Voice Impression Control in Zero-Shot TTS

    Authors: Keinichi Fujita, Shota Horiguchi, Yusuke Ijima

    Abstract: Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a lo… ▽ More

    Submitted 9 June, 2025; v1 submitted 5 June, 2025; originally announced June 2025.

    Comments: 5 pages,5 figures, Accepted to INTERSPEECH 2025

  2. arXiv:2505.24545  [pdf, ps, other

    eess.AS cs.SD

    Pretraining Multi-Speaker Identification for Neural Speaker Diarization

    Authors: Shota Horiguchi, Atsushi Ando, Marc Delcroix, Naohiro Tawara

    Abstract: End-to-end speaker diarization enables accurate overlap-aware diarization by jointly estimating multiple speakers' speech activities in parallel. This approach is data-hungry, requiring a large amount of labeled conversational data, which cannot be fully obtained from real datasets alone. To address this issue, large-scale simulated data is often used for pretraining, but it requires enormous stor… ▽ More

    Submitted 30 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech 2025

  3. arXiv:2505.06834  [pdf, ps, other

    physics.bio-ph q-bio.SC

    Local stabilizability implies global controllability in catalytic reaction systems

    Authors: Yusuke Himeoka, Shuhei A. Horiguchi, Naoto Shiraishi, Fangzhou Xiao, Tetsuya J. Kobayashi

    Abstract: Controlling complex reaction networks is a fundamental challenge in the fields of physics, biology, and systems engineering. Here, we prove a general principle for catalytic reaction systems with kinetics where the reaction order and the stoichiometric coefficient match: the local stabilizability of a given state implies global controllability within its stoichiometric compatibility class. In othe… ▽ More

    Submitted 11 May, 2025; originally announced May 2025.

  4. arXiv:2502.14872  [pdf

    math.GM

    Newton-Mandelbrot set and Murase-Mandelbrot set

    Authors: Shunji Horiguchi

    Abstract: We obtain four extended Newton's methods and three extended Mandelbrot's recurrence formulas from the Wasan (Japanese mathematics in the Edo period (1603-1868)). Furthermore, two extended Newton's methods relate to one of the extended Mandelbrot's recurrence formulas. We lead four types of extended Mandelbrot recurrence formulas. Next, we show that these become the same extended Mandelbrot set, an… ▽ More

    Submitted 1 February, 2025; originally announced February 2025.

    Comments: 15 pages, 12 figures

    MSC Class: 37F16; 49M15

  5. arXiv:2410.12182  [pdf, other

    eess.AS cs.SD

    Guided Speaker Embedding

    Authors: Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

    Abstract: This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the… ▽ More

    Submitted 1 January, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: Accepted to ICASSP 2025

  6. arXiv:2410.11243  [pdf, other

    cs.SD cs.CL eess.AS

    Investigation of Speaker Representation for Target-Speaker Speech Processing

    Authors: Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, Hiroshi Sato

    Abstract: Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker's speech even when it is corrupted by interfering speakers. While most studies have focused on training schemes or system architectures for each spec… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Accepted at IEEE SLT 2024

  7. arXiv:2410.06459  [pdf, other

    cs.SD eess.AS

    Mamba-based Segmentation Model for Speaker Diarization

    Authors: Alexis Plaquet, Naohiro Tawara, Marc Delcroix, Shota Horiguchi, Atsushi Ando, Shoko Araki

    Abstract: Mamba is a newly proposed architecture which behaves like a recurrent neural network (RNN) with attention-like capabilities. These properties are promising for speaker diarization, as attention-based models have unsuitable memory requirements for long-form audio, and traditional RNN capabilities are too limited. In this paper, we propose to assess the potential of Mamba for diarization by comparin… ▽ More

    Submitted 9 October, 2024; v1 submitted 8 October, 2024; originally announced October 2024.

    Comments: 5 pages, 4 figures. Submitted to ICASSP 2025. Code at https://github.com/nttcslab-sp/mamba-diarization

  8. arXiv:2409.20301  [pdf, other

    eess.AS cs.CL cs.SD

    Alignment-Free Training for Transducer-based Multi-Talker ASR

    Authors: Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura

    Abstract: Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  9. arXiv:2409.17488  [pdf, other

    q-bio.PE eess.SY math.OC physics.bio-ph q-bio.MN

    Optimal control of stochastic reaction networks with entropic control cost and emergence of mode-switching strategies

    Authors: Shuhei A. Horiguchi, Tetsuya J. Kobayashi

    Abstract: Controlling the stochastic dynamics of biological populations is a challenge that arises across various biological contexts. However, these dynamics are inherently nonlinear and involve a discrete state space, i.e., the number of molecules, cells, or organisms. Additionally, the possibility of extinction has a significant impact on both the dynamics and control strategies, particularly when the po… ▽ More

    Submitted 25 September, 2024; originally announced September 2024.

    Comments: 12 pages, 4 figures

  10. arXiv:2409.05554  [pdf, other

    eess.AS

    NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

    Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

    Abstract: We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: 5 pages, 4 figures, CHiME8 challenge

  11. arXiv:2408.17142  [pdf, other

    eess.AS cs.SD

    Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings

    Authors: Shota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

    Abstract: This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker speech applications such as speaker diarization and target-speaker speech processing. Despite the challenges of obtaining a single speaker's speech without pre… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Accepted to IEEE SLT 2024

  12. arXiv:2407.01857  [pdf, other

    eess.AS cs.SD eess.SP

    SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

    Authors: Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

    Abstract: Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  13. arXiv:2406.18910  [pdf, other

    cs.CL cs.SD eess.AS

    Factor-Conditioned Speaking-Style Captioning

    Authors: Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura

    Abstract: This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned capti… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  14. arXiv:2403.02169  [pdf, ps, other

    physics.bio-ph

    A theoretical basis for cell deaths

    Authors: Yusuke Himeoka, Shuhei A. Horiguchi, Tetsuya J. Kobayashi

    Abstract: Understanding deaths and life-death boundaries of cells is a fundamental challenge in biological sciences. In this study, we present a theoretical framework for investigating cell death. We conceptualize cell death as a controllability problem within dynamical systems, and compute the life-death boundary through the development of "stoichiometric rays". This method utilizes enzyme activity as cont… ▽ More

    Submitted 11 October, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  15. arXiv:2402.08209  [pdf, other

    cs.LG cs.AI

    Thresholding Data Shapley for Data Cleansing Using Multi-Armed Bandits

    Authors: Hiroyuki Namba, Shota Horiguchi, Masaki Hamamoto, Masashi Egi

    Abstract: Data cleansing aims to improve model performance by removing a set of harmful instances from the training dataset. Data Shapley is a common theoretically guaranteed method to evaluate the contribution of each instance to model performance; however, it requires training on all subsets of the training data, which is computationally expensive. In this paper, we propose an iterativemethod to fast iden… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

  16. arXiv:2309.01013  [pdf, other

    cs.LG

    Streaming Active Learning for Regression Problems Using Regression via Classification

    Authors: Shota Horiguchi, Kota Dohi, Yohei Kawaguchi

    Abstract: One of the challenges in deploying a machine learning model is that the model's performance degrades as the operating environment changes. To maintain the performance, streaming active learning is used, in which the model is retrained by adding a newly annotated sample to the training dataset if the prediction of the sample is not certain enough. Although many streaming active learning methods hav… ▽ More

    Submitted 15 December, 2023; v1 submitted 2 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  17. arXiv:2305.17758  [pdf, ps, other

    cs.SD eess.AS

    CAPTDURE: Captioned Sound Dataset of Single Sources

    Authors: Yuki Okamoto, Kanta Shimonishi, Keisuke Imoto, Kota Dohi, Shota Horiguchi, Yohei Kawaguchi

    Abstract: In conventional studies on environmental sound separation and synthesis using captions, datasets consisting of multiple-source sounds with their captions were used for model training. However, when we collect the captions for multiple-source sound, it is not easy to collect detailed captions for each sound source, such as the number of sound occurrences and timbre. Therefore, it is difficult to ex… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH2023

  18. arXiv:2305.15518  [pdf, other

    eess.AS cs.SD

    Spoofing Attacker Also Benefits from Self-Supervised Pretrained Model

    Authors: Aoi Ito, Shota Horiguchi

    Abstract: Large-scale pretrained models using self-supervised learning have reportedly improved the performance of speech anti-spoofing. However, the attacker side may also make use of such models. Also, since it is very expensive to train such models from scratch, pretrained models on the Internet are often used, but the attacker and defender may possibly use the same pretrained model. This paper investiga… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  19. arXiv:2211.14455  [pdf, ps, other

    cs.IT math.DG math.ST physics.chem-ph physics.data-an

    Information Geometry of Dynamics on Graphs and Hypergraphs

    Authors: Tetsuya J. Kobayashi, Dimitri Loutchko, Atsushi Kamimura, Shuhei A. Horiguchi, Yuki Sughiyama

    Abstract: We introduce a new information-geometric structure associated with the dynamics on discrete objects such as graphs and hypergraphs. The presented setup consists of two dually flat structures built on the vertex and edge spaces, respectively. The former is the conventional duality between density and potential, e.g., the probability density and its logarithmic form induced by a convex thermodynamic… ▽ More

    Submitted 5 August, 2023; v1 submitted 25 November, 2022; originally announced November 2022.

    Comments: 92 pages, 9 figures

  20. arXiv:2211.00947  [pdf, other

    stat.ML cs.LG

    Linear Embedding-based High-dimensional Batch Bayesian Optimization without Reconstruction Mappings

    Authors: Shuhei A. Horiguchi, Tomoharu Iwata, Taku Tsuzuki, Yosuke Ozawa

    Abstract: The optimization of high-dimensional black-box functions is a challenging problem. When a low-dimensional linear embedding structure can be assumed, existing Bayesian optimization (BO) methods often transform the original problem into optimization in a low-dimensional space. They exploit the low-dimensional structure and reduce the computational burden. However, we reveal that this approach could… ▽ More

    Submitted 2 November, 2022; originally announced November 2022.

  21. arXiv:2210.03459  [pdf, other

    eess.AS cs.CL cs.SD

    Mutual Learning of Single- and Multi-Channel End-to-End Neural Diarization

    Authors: Shota Horiguchi, Yuki Takashima, Shinji Watanabe, Paola Garcia

    Abstract: Due to the high performance of multi-channel speech processing, we can use the outputs from a multi-channel model as teacher labels when training a single-channel model with knowledge distillation. To the contrary, it is also known that single-channel speech data can benefit multi-channel models by mixing it with multi-channel speech data during training or by using it for model pretraining. This… ▽ More

    Submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted to IEEE SLT 2022

  22. arXiv:2207.00216  [pdf, other

    eess.AS

    Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models

    Authors: Yuki Takashima, Shota Horiguchi, Shinji Watanabe, Paola García, Yohei Kawaguchi

    Abstract: In this paper, we present an incremental domain adaptation technique to prevent catastrophic forgetting for an end-to-end automatic speech recognition (ASR) model. Conventional approaches require extra parameters of the same size as the model for optimization, and it is difficult to apply these approaches to end-to-end ASR models because they have a huge amount of parameters. To solve this problem… ▽ More

    Submitted 1 July, 2022; originally announced July 2022.

    Comments: Accepted for Interspeech 2022

  23. arXiv:2206.02432  [pdf, other

    eess.AS cs.CL cs.SD

    Online Neural Diarization of Unlimited Numbers of Speakers Using Global and Local Attractors

    Authors: Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yuki Takashima, Yohei Kawaguchi

    Abstract: A method to perform offline and online speaker diarization for an unlimited number of speakers is described in this paper. End-to-end neural diarization (EEND) has achieved overlap-aware speaker diarization by formulating it as a multi-label classification problem. It has also been extended for a flexible number of speakers by introducing speaker-wise attractors. However, the output number of spea… ▽ More

    Submitted 22 December, 2022; v1 submitted 6 June, 2022; originally announced June 2022.

    Comments: Accepted to IEEE/ACM TASLP

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 706-720, 2023

  24. Cellular gradient flow structure connects single-cell-level rules and population-level dynamics

    Authors: Shuhei A. Horiguchi, Tetsuya J. Kobayashi

    Abstract: In multicellular systems, the single-cell behaviors should be coordinated consistently with the overall population dynamics and functions. However, the interrelation between single-cell rules and the population-level goal is still elusive. In this work, we reveal that these two levels are naturally connected via a gradient flow structure of the heterogeneous cellular population and that biological… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

  25. arXiv:2205.12683  [pdf, other

    cs.LG cs.AI stat.ML

    Rethinking Fano's Inequality in Ensemble Learning

    Authors: Terufumi Morishita, Gaku Morio, Shota Horiguchi, Hiroaki Ozaki, Nobuo Nukaga

    Abstract: We propose a fundamental theory on ensemble learning that answers the central question: what factors make an ensemble system good or bad? Previous studies used a variant of Fano's inequality of information theory and derived a lower bound of the classification error rate on the basis of the $\textit{accuracy}$ and $\textit{diversity}$ of models. We revisit the original Fano's inequality and argue… ▽ More

    Submitted 16 November, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: ICML2022

  26. arXiv:2204.11232  [pdf, other

    eess.AS cs.CL cs.SD

    Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization

    Authors: Natsuo Yamashita, Shota Horiguchi, Takeshi Homma

    Abstract: This paper investigates a method for simulating natural conversation in the model training of end-to-end neural diarization (EEND). Due to the lack of any annotated real conversational dataset, EEND is usually pretrained on a large-scale simulated conversational dataset first and then adapted to the target real dataset. Simulated datasets play an essential role in the training of EEND, but as yet… ▽ More

    Submitted 24 April, 2022; originally announced April 2022.

    Comments: Accepted to Speaker Odyssey 2022

  27. arXiv:2112.00209  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Environmental Sound Extraction Using Onomatopoeic Words

    Authors: Yuki Okamoto, Shota Horiguchi, Masaaki Yamamoto, Keisuke Imoto, Yohei Kawaguchi

    Abstract: An onomatopoeic word, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeic words to specify the target sound to be extracted. By this method, we estimate a time-frequency mask from an input mixture spectrogram and an onomatopoe… ▽ More

    Submitted 16 February, 2022; v1 submitted 30 November, 2021; originally announced December 2021.

    Comments: Accepted to ICASSP2022

  28. arXiv:2110.04694  [pdf, other

    eess.AS cs.CL cs.SD

    Multi-Channel End-to-End Neural Diarization with Distributed Microphones

    Authors: Shota Horiguchi, Yuki Takashima, Paola Garcia, Shinji Watanabe, Yohei Kawaguchi

    Abstract: Recent progress on end-to-end neural diarization (EEND) has enabled overlap-aware speaker diarization with a single neural network. This paper proposes to enhance EEND by using multi-channel signals from distributed microphones. We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input: spatio-temporal and co-attention encoders. Both are independent of t… ▽ More

    Submitted 28 March, 2022; v1 submitted 9 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP 2022

  29. arXiv:2109.12362  [pdf

    math.GM

    Binomial expansion of Newton's method

    Authors: Shunji Horiguchi

    Abstract: We extend the Newton's method and show the extended Newton's method leads to the binomial expansion of Newton's method that the convergences become the quadratic and linearly. In case of the quadratic convergence, we give the convergence comparison of the binomial expansion of Newton's method and Newton's method. And we give convergence comparisons of the binomial expansion of Newton's method and… ▽ More

    Submitted 25 September, 2021; originally announced September 2021.

    Comments: 9 pages, 4 tables

  30. arXiv:2107.01545  [pdf, other

    eess.AS cs.CL cs.SD

    Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors

    Authors: Shota Horiguchi, Shinji Watanabe, Paola Garcia, Yawen Xue, Yuki Takashima, Yohei Kawaguchi

    Abstract: Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an un… ▽ More

    Submitted 23 September, 2021; v1 submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted to ASRU 2021

  31. Encoder-Decoder Based Attractors for End-to-End Neural Diarization

    Authors: Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Paola Garcia

    Abstract: This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional cascaded approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based a… ▽ More

    Submitted 28 March, 2022; v1 submitted 20 June, 2021; originally announced June 2021.

    Comments: Accepted to IEEE/ACM TASLP. This article is based on our previous conference paper arxiv:2005.09921

  32. arXiv:2106.04764  [pdf, other

    eess.AS cs.SD

    Semi-Supervised Training with Pseudo-Labeling for End-to-End Neural Diarization

    Authors: Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe, Paola García, Kenji Nagamatsu

    Abstract: In this paper, we present a semi-supervised training technique using pseudo-labeling for end-to-end neural diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. However, to get a well-tuned model, EEND requires labeled data for all the joint speech activities of every speaker at each tim… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: Accepted for Interspeech 2021

  33. arXiv:2106.04078  [pdf, other

    eess.AS cs.SD

    End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection

    Authors: Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Paola García, Kenji Nagamatsu

    Abstract: In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted for SLT 2021

    Journal ref: IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 849-856

  34. arXiv:2102.01363  [pdf, other

    eess.AS cs.CL cs.SD

    The Hitachi-JHU DIHARD III System: Competitive End-to-End Neural Diarization and X-Vector Clustering Systems Combined by DOVER-Lap

    Authors: Shota Horiguchi, Nelson Yalta, Paola Garcia, Yuki Takashima, Yawen Xue, Desh Raj, Zili Huang, Yusuke Fujita, Shinji Watanabe, Sanjeev Khudanpur

    Abstract: This paper provides a detailed description of the Hitachi-JHU system that was submitted to the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results of the five subsystems: two x-vector-based subsystems, two end-to-end neural diarization-based subsystems, and one hybrid subsystem. We refine each system and all five subsystems become competitive and complementary. After… ▽ More

    Submitted 2 February, 2021; originally announced February 2021.

  35. arXiv:2101.08473  [pdf, other

    cs.SD eess.AS

    Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers

    Authors: Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe, Paola Garcia, Kenji Nagamatsu

    Abstract: We propose a streaming diarization method based on an end-to-end neural diarization (EEND) model, which handles flexible numbers of speakers and overlapping speech. In our previous study, the speaker-tracing buffer (STB) mechanism was proposed to achieve a chunk-wise streaming diarization using a pre-trained EEND model. STB traces the speaker information in previous chunks to map the speakers in a… ▽ More

    Submitted 6 April, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

  36. arXiv:2012.10055  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker Diarization as Post-Processing

    Authors: Shota Horiguchi, Paola Garcia, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

    Abstract: This paper investigates the utilization of an end-to-end diarization model as post-processing of conventional clustering-based diarization. Clustering-based diarization methods partition frames into clusters of the number of speakers; thus, they typically cannot handle overlapping speech because each frame is assigned to one speaker. On the other hand, some end-to-end diarization methods can handl… ▽ More

    Submitted 23 December, 2020; v1 submitted 18 December, 2020; originally announced December 2020.

  37. arXiv:2011.07791  [pdf, other

    eess.AS cs.SD eess.SP

    Block-Online Guided Source Separation

    Authors: Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu

    Abstract: We propose a block-online algorithm of guided source separation (GSS). GSS is a speech separation method that uses diarization information to update parameters of the generative model of observation signals. Previous studies have shown that GSS performs well in multi-talker scenarios. However, it requires a large amount of calculation time, which is an obstacle to the deployment of online applicat… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: Accepted to SLT 2021

  38. arXiv:2007.15868  [pdf, other

    eess.AS cs.CL cs.SD

    Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones

    Authors: Shota Horiguchi, Yusuke Fujita, Kenji Nagamatsu

    Abstract: A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without consid… ▽ More

    Submitted 31 July, 2020; originally announced July 2020.

    Comments: Accepted to INTERSPEECH 2020

  39. arXiv:2006.02616  [pdf, other

    eess.AS cs.SD

    Online End-to-End Neural Diarization with Speaker-Tracing Buffer

    Authors: Yawen Xue, Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Kenji Nagamatsu

    Abstract: This paper proposes a novel online speaker diarization algorithm based on a fully supervised self-attention mechanism (SA-EEND). Online diarization inherently presents a speaker's permutation problem due to the possibility to assign speaker regions incorrectly across the recording. To circumvent this inconsistency, we proposed a speaker-tracing buffer mechanism that selects several input frames re… ▽ More

    Submitted 6 March, 2021; v1 submitted 3 June, 2020; originally announced June 2020.

    Comments: Accepted to SLT 2021

  40. arXiv:2006.01796  [pdf, other

    eess.AS cs.CL cs.SD

    Neural Speaker Diarization with Speaker-Wise Chain Rule

    Authors: Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Jing Shi, Kenji Nagamatsu

    Abstract: Speaker diarization is an essential step for processing multi-speaker audio. Although an end-to-end neural diarization (EEND) method achieved state-of-the-art performance, it is limited to a fixed number of speakers. In this paper, we solve this fixed number of speaker issue by a novel speaker-wise conditional inference method based on the probabilistic chain rule. In the proposed method, each spe… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

    Comments: Submitted to Interspeech 2020

  41. arXiv:2005.09921  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

    Authors: Shota Horiguchi, Yusuke Fujita, Shinji Watanabe, Yawen Xue, Kenji Nagamatsu

    Abstract: End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexi… ▽ More

    Submitted 5 October, 2020; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Accepted to INTERSPEECH 2020

  42. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  43. arXiv:2003.02966  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification

    Authors: Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu

    Abstract: The most common approach to speaker diarization is clustering of speaker embeddings. However, the clustering-based approach has a number of problems; i.e., (i) it is not optimized to minimize diarization errors directly, (ii) it cannot handle speaker overlaps correctly, and (iii) it has trouble adapting their speaker embedding models to real audio recordings with speaker overlaps. To solve these p… ▽ More

    Submitted 24 February, 2020; originally announced March 2020.

    Comments: Submission to IEEE TASLP. This article draws from our previous conference papers: arxiv:1909.06247 and arxiv:1909.05952

  44. arXiv:1909.08103  [pdf, other

    cs.CL cs.SD eess.AS

    Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

    Authors: Naoyuki Kanda, Shota Horiguchi, Yusuke Fujita, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

    Abstract: This paper investigates the use of target-speaker automatic speech recognition (TS-ASR) for simultaneous speech recognition and speaker diarization of single-channel dialogue recordings. TS-ASR is a technique to automatically extract and recognize only the speech of a target speaker given a short sample utterance of that speaker. One obvious drawback of TS-ASR is that it cannot be used when the sp… ▽ More

    Submitted 17 September, 2019; originally announced September 2019.

    Comments: Accepted to ASRU 2019

  45. arXiv:1909.06247  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Neural Speaker Diarization with Self-attention

    Authors: Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Yawen Xue, Kenji Nagamatsu, Shinji Watanabe

    Abstract: Speaker diarization has been mainly developed based on the clustering of speaker embeddings. However, the clustering-based approach has two major problems; i.e., (i) it is not optimized to minimize diarization errors directly, and (ii) it cannot handle speaker overlaps correctly. To solve these problems, the End-to-End Neural Diarization (EEND), in which a bidirectional long short-term memory (BLS… ▽ More

    Submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted for ASRU 2019

  46. arXiv:1909.05952  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Neural Speaker Diarization with Permutation-Free Objectives

    Authors: Yusuke Fujita, Naoyuki Kanda, Shota Horiguchi, Kenji Nagamatsu, Shinji Watanabe

    Abstract: In this paper, we propose a novel end-to-end neural-network-based speaker diarization method. Unlike most existing methods, our proposed method does not have separate modules for extraction and clustering of speaker representations. Instead, our model has a single neural network that directly outputs speaker diarization results. To realize such a model, we formulate the speaker diarization problem… ▽ More

    Submitted 12 September, 2019; originally announced September 2019.

    Comments: Accepted to INTERSPEECH 2019

  47. arXiv:1906.10876  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition

    Authors: Naoyuki Kanda, Shota Horiguchi, Ryoichi Takashima, Yusuke Fujita, Kenji Nagamatsu, Shinji Watanabe

    Abstract: In this paper, we propose a novel auxiliary loss function for target-speaker automatic speech recognition (ASR). Our method automatically extracts and transcribes target speaker's utterances from a monaural mixture of multiple speakers speech given a short sample of the target speaker. The proposed auxiliary loss function attempts to additionally maximize interference speaker ASR accuracy during t… ▽ More

    Submitted 26 June, 2019; originally announced June 2019.

    Comments: Accepted to INTERSPEECH 2019

  48. arXiv:1905.12230  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR

    Authors: Naoyuki Kanda, Christoph Boeddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, Reinhold Haeb-Umbach

    Abstract: In this paper, we present Hitachi and Paderborn University's joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural conversational content, and possibly (4) insufficient training data. As a… ▽ More

    Submitted 26 June, 2019; v1 submitted 29 May, 2019; originally announced May 2019.

    Comments: Accepted to INTERSPEECH 2019

  49. Personalized Classifier for Food Image Recognition

    Authors: Shota Horiguchi, Sosuke Amano, Makoto Ogawa, Kiyoharu Aizawa

    Abstract: Currently, food image recognition tasks are evaluated against fixed datasets. However, in real-world conditions, there are cases in which the number of samples in each class continues to increase and samples from novel classes appear. In particular, dynamic datasets in which each individual user creates samples and continues the updating process often have content that varies considerably between… ▽ More

    Submitted 8 April, 2018; originally announced April 2018.

    Comments: Accepted to IEEE Transaction on Multimedia. http://ieeexplore.ieee.org/document/8316919/

    Journal ref: IEEE Transactions on Multimedia 20.10 (2018): 2836-2848

  50. Significance of Softmax-based Features in Comparison to Distance Metric Learning-based Features

    Authors: Shota Horiguchi, Daiki Ikami, Kiyoharu Aizawa

    Abstract: The extraction of useful deep features is important for many computer vision tasks. Deep features extracted from classification networks have proved to perform well in those tasks. To obtain features of greater usefulness, end-to-end distance metric learning (DML) has been applied to train the feature extractor directly. However, in these DML studies, there were no equitable comparisons between fe… ▽ More

    Submitted 13 April, 2019; v1 submitted 29 December, 2017; originally announced December 2017.

    Comments: 6 pages

    Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019