Skip to main content

Showing 1–17 of 17 results for author: Mimura, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2502.09859  [pdf, other

    eess.AS eess.SP

    Microphone Array Geometry Independent Multi-Talker Distant ASR: NTT System for the DASR Task of the CHiME-8 Challenge

    Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguch, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

    Abstract: In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarization first, followed by speech enhancement, and then… ▽ More

    Submitted 13 February, 2025; originally announced February 2025.

    Comments: 55 pages, 12 figures

  2. arXiv:2409.20313  [pdf, other

    eess.AS cs.CL cs.SD

    Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

    Authors: Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami

    Abstract: A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: Accepted to Interspeech 2024

  3. arXiv:2409.20301  [pdf, other

    eess.AS cs.CL cs.SD

    Alignment-Free Training for Transducer-based Multi-Talker ASR

    Authors: Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura

    Abstract: Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  4. arXiv:2409.05554  [pdf, other

    eess.AS

    NTT Multi-Speaker ASR System for the DASR Task of CHiME-8 Challenge

    Authors: Naoyuki Kamo, Naohiro Tawara, Atsushi Ando, Takatomo Kano, Hiroshi Sato, Rintaro Ikeshita, Takafumi Moriya, Shota Horiguchi, Kohei Matsuura, Atsunori Ogawa, Alexis Plaquet, Takanori Ashihara, Tsubasa Ochiai, Masato Mimura, Marc Delcroix, Tomohiro Nakatani, Taichi Asami, Shoko Araki

    Abstract: We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we developed a new multi-channel speaker counting approach… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: 5 pages, 4 figures, CHiME8 challenge

  5. arXiv:2408.00205  [pdf, other

    cs.CL eess.AS

    Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Usin… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

    Comments: Accepted to Interspeech2024. Dataset: https://huggingface.co/datasets/komats/mega-ssum

  6. arXiv:2407.01857  [pdf, other

    eess.AS cs.SD eess.SP

    SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

    Authors: Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

    Abstract: Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  7. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  8. arXiv:2209.04062  [pdf, other

    cs.CL cs.SD eess.AS

    Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slo… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  9. arXiv:2209.02030  [pdf, other

    cs.CL cs.SD eess.AS

    Distilling the Knowledge of BERT for CTC-based ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this… ▽ More

    Submitted 5 September, 2022; originally announced September 2022.

  10. arXiv:2110.01857  [pdf, other

    cs.CL eess.AS

    ASR Rescoring and Confidence Estimation with ELECTRA

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors should be selected from the n-best list using a language model (LM). However, LMs are usually trained to maximize the likelihood of correct word sequences, not to detect ASR errors. We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP ta… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted in ASRU2021

  11. arXiv:2008.12048  [pdf, ps, other

    eess.AS

    End-to-end Music-mixed Speech Recognition

    Authors: Jeongwoo Woo, Masato Mimura, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has ach… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Comments: Submitted to APSIPA 2020

  12. arXiv:2008.03822  [pdf, other

    cs.CL eess.AS

    Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generat… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted in INTERSPEECH2020

  13. arXiv:2005.09394  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Enhancing Monotonic Multihead Attention for Streaming ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: We investigate a monotonic multihead attention (MMA) by extending hard monotonic attention to Transformer-based automatic speech recognition (ASR) for online streaming applications. For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries. However, we found not all MA… ▽ More

    Submitted 30 September, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  14. arXiv:2005.09256  [pdf, other

    eess.AS cs.CL

    Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

    Authors: Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are… ▽ More

    Submitted 31 July, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted for Interspeech 2020

  15. arXiv:2002.06675  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

    Authors: Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of… ▽ More

    Submitted 16 May, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

    Comments: Accepted in LREC 2020

  16. arXiv:1903.09341  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

    Authors: Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take… ▽ More

    Submitted 31 March, 2019; v1 submitted 21 March, 2019; originally announced March 2019.

  17. arXiv:1710.11439  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

    Authors: Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech. A standard approach to speech enhancement is to train a deep neural network (DNN) to take noisy speech as input and output clean speech. Although this supervised approach requires a very large amount of pair data for training, it is not ro… ▽ More

    Submitted 19 March, 2018; v1 submitted 31 October, 2017; originally announced October 2017.

    Comments: 5 pages, 3 figures, version that Eqs. (9), (19), and (20) in v2 (submitted to ICASSP 2018) are corrected. Samples available here: http://sap.ist.i.kyoto-u.ac.jp/members/yoshiaki/demo/vae-nmf/