Skip to main content

Showing 1–12 of 12 results for author: Kumatani, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.04518   

    eess.AS cs.CL

    Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model

    Authors: Haibin Wu, Yuxuan Hu, Ruchao Fan, Xiaofei Wang, Kenichi Kumatani, Bo Ren, Jianwei Yu, Heng Lu, Lijuan Wang, Yao Qian, Jinyu Li

    Abstract: Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model, offering a promising direction for spoken dialogue systems. The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality. In this work, we systematically compare representative joint speech-text decoding strategies-including the interleav… ▽ More

    Submitted 12 June, 2025; v1 submitted 4 June, 2025; originally announced June 2025.

    Comments: Our company need to do internal review

  2. arXiv:2205.08598  [pdf, other

    cs.SD cs.CL eess.AS eess.SP

    Deploying self-supervised learning in the wild for hybrid automatic speech recognition

    Authors: Mostafa Karimi, Changliang Liu, Kenichi Kumatani, Yao Qian, Tianyu Wu, Jian Wu

    Abstract: Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR). These great improvements have been reported mostly based on highly curated datasets such as LibriSpeech for non-streaming End-to-End ASR models. However, the pivotal characteristics of SSL is to be utilized for any untranscribed audio data. In this paper, we provide a full exploration on… ▽ More

    Submitted 17 May, 2022; originally announced May 2022.

  3. arXiv:2112.05826  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Sequence-level self-learning with multiple hypotheses

    Authors: Kenichi Kumatani, Dimitrios Dimitriadis, Yashesh Gaur, Robert Gmyr, Sefik Emre Eskimez, Jinyu Li, Michael Zeng

    Abstract: In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multipl… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: Published in Interspeech 2020: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Report number: https://www.isca-speech.org/archive_v0/Interspeech_2020/pdfs/2020.pdf

    Journal ref: Proc. Interspeech 2020, page 3775-3779

  4. arXiv:2112.05820  [pdf, other

    cs.CL cs.AI cs.LG eess.AS

    Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

    Authors: Kenichi Kumatani, Robert Gmyr, Felipe Cruz Salinas, Linquan Liu, Wei Zuo, Devang Patel, Eric Sun, Yu Shi

    Abstract: The sparsely-gated Mixture of Experts (MoE) can magnify a network capacity with a little computational complexity. In this work, we investigate how multi-lingual Automatic Speech Recognition (ASR) networks can be scaled up with a simple routing algorithm in order to achieve better accuracy. More specifically, we apply the sparsely-gated MoE technique to two types of networks: Sequence-to-Sequence… ▽ More

    Submitted 4 January, 2022; v1 submitted 10 December, 2021; originally announced December 2021.

  5. arXiv:2110.07909  [pdf, other

    cs.CL eess.AS

    Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

    Authors: Rimita Lahiri, Kenichi Kumatani, Eric Sun, Yao Qian

    Abstract: Multilingual end-to-end(E2E) models have shown a great potential in the expansion of the language coverage in the realm of automatic speech recognition(ASR). In this paper, we aim to enhance the multilingual ASR performance in two ways, 1)studying the impact of feeding a one-hot vector identifying the language, 2)formulating the task with a meta-learning objective combined with self-supervised lea… ▽ More

    Submitted 15 October, 2021; originally announced October 2021.

    Comments: 5 pages

  6. arXiv:2107.05233  [pdf, other

    eess.AS

    UniSpeech at scale: An Empirical Study of Pre-training Method on Large-Scale Speech Recognition Dataset

    Authors: Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Yao Qian, Kenichi Kumatani, Furu Wei

    Abstract: Recently, there has been a vast interest in self-supervised learning (SSL) where the model is pre-trained on large scale unlabeled data and then fine-tuned on a small labeled dataset. The common wisdom is that SSL helps resource-limited tasks in which only a limited amount of labeled data is available. The benefit of SSL keeps diminishing when the labeled training data amount increases. To our bes… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

  7. arXiv:2101.07597  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

    Authors: Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

    Abstract: In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve… ▽ More

    Submitted 10 June, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

    Comments: accepted by ICML2021

  8. arXiv:2002.02520  [pdf, other

    cs.SD cs.CL eess.AS

    Robust Multi-channel Speech Recognition using Frequency Aligned Network

    Authors: Taejin Park, Kenichi Kumatani, Minhua Wu, Shiva Sundaram

    Abstract: Conventional speech enhancement technique such as beamforming has known benefits for far-field speech recognition. Our own work in frequency-domain multi-channel acoustic modeling has shown additional improvements by training a spatial filtering layer jointly within an acoustic model. In this paper, we further develop this idea and use frequency aligned network for robust multi-channel automatic s… ▽ More

    Submitted 6 February, 2020; originally announced February 2020.

  9. Fully Learnable Front-End for Multi-Channel Acoustic Modeling using Semi-Supervised Learning

    Authors: Sanna Wager, Aparna Khare, Minhua Wu, Kenichi Kumatani, Shiva Sundaram

    Abstract: In this work, we investigated the teacher-student training paradigm to train a fully learnable multi-channel acoustic model for far-field automatic speech recognition (ASR). Using a large offline teacher model trained on beamformed audio, we trained a simpler multi-channel student acoustic model used in the speech recognition system. For the student, both multi-channel feature extraction layers an… ▽ More

    Submitted 31 January, 2020; originally announced February 2020.

    Comments: To appear in ICASSP 2020

  10. Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition

    Authors: Kenichi Kumatani, Minhua Wu, Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister

    Abstract: The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the differenc… ▽ More

    Submitted 28 April, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

    Comments: ICASSP2019, 5 pages. arXiv admin note: substantial text overlap with arXiv:1903.05299

    Report number: https://doi.org/10.1109/ICASSP.2019.8682294

    Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, page 6635-6639

  11. Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

    Authors: Minhua Wu, Kenichi Kumatani, Shiva Sundaram, Nikko Strom, Bjorn Hoffmeister

    Abstract: Conventional far-field automatic speech recognition (ASR) systems typically employ microphone array techniques for speech enhancement in order to improve robustness against noise or reverberation. However, such speech enhancement techniques do not always yield ASR accuracy improvement because the optimization criterion for speech enhancement is not directly relevant to the ASR objective. In this w… ▽ More

    Submitted 28 April, 2019; v1 submitted 12 March, 2019; originally announced March 2019.

    Comments: ICASSP 2019, 5 pages

    Report number: https://doi.org/10.1109/ICASSP.2019.8682977

    Journal ref: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019, pages 6640-6644

  12. arXiv:1901.02348  [pdf, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning

    Authors: Ladislav Mošner, Minhua Wu, Anirudh Raju, Sree Hari Krishnan Parthasarathi, Kenichi Kumatani, Shiva Sundaram, Roland Maas, Björn Hoffmeister

    Abstract: For real-world speech recognition applications, noise robustness is still a challenge. In this work, we adopt the teacher-student (T/S) learning technique using a parallel clean and noisy corpus for improving automatic speech recognition (ASR) performance under multimedia noise. On top of that, we apply a logits selection method which only preserves the k highest values to prevent wrong emphasis o… ▽ More

    Submitted 15 March, 2019; v1 submitted 5 January, 2019; originally announced January 2019.

    Comments: To Appear in ICASSP 2019