Skip to main content

Showing 1–7 of 7 results for author: Someki, M

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.01845  [pdf, ps, other

    eess.AS cs.LG cs.SD

    On-device Streaming Discrete Speech Units

    Authors: Kwanghee Choi, Masao Someki, Emma Strubell, Shinji Watanabe

    Abstract: Discrete speech units (DSUs) are derived from clustering the features of self-supervised speech models (S3Ms). DSUs offer significant advantages for on-device streaming speech applications due to their rich phonetic information, high transmission efficiency, and seamless integration with large language models. However, conventional DSU-based approaches are impractical as they require full-length s… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

    Comments: Accepted to Interspeech 2025, source code at https://github.com/Masao-Someki/StreamingDSU

  2. arXiv:2505.18860  [pdf, ps, other

    eess.AS

    Context-Driven Dynamic Pruning for Large Speech Foundation Models

    Authors: Masao Someki, Shikhar Bharadwaj, Atharva Anand Joshi, Chyi-Jiunn Lin, Jinchuan Tian, Jee-weon Jung, Markus Müller, Nathan Susanj, Jing Liu, Shinji Watanabe

    Abstract: Speech foundation models achieve strong generalization across languages and acoustic conditions, but require significant computational resources for inference. In the context of speech foundation models, pruning techniques have been studied that dynamically optimize model structures based on the target audio leveraging external context. In this work, we extend this line of research and propose con… ▽ More

    Submitted 24 May, 2025; originally announced May 2025.

    Comments: Accepted at Interspeech 2025

  3. arXiv:2505.14874  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages

    Authors: Chin-Jou Li, Eunjung Yeo, Kwanghee Choi, Paula Andrea Pérez-Toro, Masao Someki, Rohan Kumar Das, Zhengjun Yue, Juan Rafael Orozco-Arroyave, Elmar Nöth, David R. Mortensen

    Abstract: Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.… ▽ More

    Submitted 30 May, 2025; v1 submitted 20 May, 2025; originally announced May 2025.

    Comments: 5 pages, 1 figure, Accepted to Interspeech 2025

  4. arXiv:2409.09506  [pdf, other

    cs.SD cs.AI eess.AS

    ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

    Authors: Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

    Abstract: We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, a… ▽ More

    Submitted 14 September, 2024; originally announced September 2024.

    Comments: Accepted to SLT 2024

  5. Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

    Authors: Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

    Abstract: Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam sea… ▽ More

    Submitted 30 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2023

  6. arXiv:2209.09756  [pdf, other

    eess.AS

    ESPnet-ONNX: Bridging a Gap Between Research and Production

    Authors: Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

    Abstract: In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish… ▽ More

    Submitted 14 November, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: Accepted to APSIPA ASC 2022

  7. A Comparative Study on Transformer vs RNN in Speech Applications

    Authors: Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, Shinji Watanabe, Takenori Yoshimura, Wangyou Zhang

    Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We underto… ▽ More

    Submitted 28 September, 2019; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: Accepted at ASRU 2019

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2019