Skip to main content

Showing 1–27 of 27 results for author: Kashiwagi, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.01439  [pdf, ps, other

    cs.CL eess.AS

    Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

    Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Satoshi Asakawa

    Abstract: This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises va… ▽ More

    Submitted 2 June, 2025; originally announced June 2025.

  2. arXiv:2506.00722  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Chain-of-Thought Training for Open E2E Spoken Dialogue Systems

    Authors: Siddhant Arora, Jinchuan Tian, Hayato Futami, Jee-weon Jung, Jiatong Shi, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Unlike traditional cascaded pipelines, end-to-end (E2E) spoken dialogue systems preserve full differentiability and capture non-phonemic information, making them well-suited for modeling spoken interactions. However, existing E2E approaches often require large-scale training data and generates responses lacking semantic coherence. We propose a simple yet effective strategy leveraging a chain-of-th… ▽ More

    Submitted 31 May, 2025; originally announced June 2025.

    Comments: Accepted at INTERSPEECH 2025

  3. arXiv:2505.16207  [pdf, ps, other

    cs.SD eess.AS

    Differentiable K-means for Fully-optimized Discrete Token-based ASR

    Authors: Kentaro Onda, Yosuke Kashiwagi, Emiru Tsunoo, Hayato Futami, Shinji Watanabe

    Abstract: Recent studies have highlighted the potential of discrete tokens derived from self-supervised learning (SSL) models for various speech-related tasks. These tokens serve not only as substitutes for text in language modeling but also as intermediate representations for tasks such as automatic speech recognition (ASR). However, discrete tokens are typically obtained via k-means clustering of SSL feat… ▽ More

    Submitted 22 May, 2025; originally announced May 2025.

    Comments: Accepted by Interspeech2025

  4. arXiv:2503.08533  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems

    Authors: Siddhant Arora, Yifan Peng, Jiatong Shi, Jinchuan Tian, William Chen, Shikhar Bharadwaj, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Shuichiro Shimizu, Vaibhav Srivastav, Shinji Watanabe

    Abstract: Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo furthe… ▽ More

    Submitted 11 March, 2025; originally announced March 2025.

    Comments: Accepted at NAACL 2025 Demo Track

  5. arXiv:2409.15732  [pdf, other

    cs.CL cs.SD eess.AS

    Hypothesis Clustering and Merging: Novel MultiTalker Speech Recognition with Speaker Tokens

    Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

    Abstract: In many real-world scenarios, such as meetings, multiple speakers are present with an unknown number of participants, and their utterances often overlap. We address these multi-speaker challenges by a novel attention-based encoder-decoder method augmented with special speaker class tokens obtained by speaker clustering. During inference, we select multiple recognition hypotheses conditioned on pre… ▽ More

    Submitted 24 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  6. arXiv:2406.16107  [pdf, ps, other

    eess.AS cs.CL

    Decoder-only Architecture for Streaming End-to-end Speech Recognition

    Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

    Abstract: Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable characteristic for streaming applications of ASR. In this work, we propose to use a decoder-only architecture for blockwise streaming ASR. In our approach, speech features… ▽ More

    Submitted 1 August, 2024; v1 submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted for Interspeech 2024

  7. arXiv:2406.12611  [pdf, other

    cs.SD cs.CL eess.AS

    Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting

    Authors: Yosuke Kashiwagi, Hayato Futami, Emiru Tsunoo, Siddhant Arora, Shinji Watanabe

    Abstract: End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attentio… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH 2024

  8. arXiv:2406.12317  [pdf, other

    cs.CL eess.AS

    Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model

    Authors: Hayato Futami, Siddhant Arora, Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specif… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech2024

  9. arXiv:2312.09582  [pdf, other

    cs.CL cs.SD eess.AS

    Phoneme-aware Encoding for Prefix-tree-based Contextual ASR

    Authors: Hayato Futami, Emiru Tsunoo, Yosuke Kashiwagi, Hiroaki Ogawa, Siddhant Arora, Shinji Watanabe

    Abstract: In speech recognition applications, it is important to recognize context-specific rare words, such as proper nouns. Tree-constrained Pointer Generator (TCPGen) has shown promise for this purpose, which efficiently biases such words with a prefix tree. While the original TCPGen relies on grapheme-based encoding, we propose extending it with phoneme-aware encoding to better recognize words of unusua… ▽ More

    Submitted 15 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP2024

  10. arXiv:2310.02973  [pdf, other

    cs.CL cs.SD eess.AS

    UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

    Authors: Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

    Abstract: Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additio… ▽ More

    Submitted 3 April, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted at NAACL 2024

  11. arXiv:2309.08876  [pdf, ps, other

    eess.AS cs.SD

    Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

    Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

    Abstract: Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training schemes to use text-only data. Inspired by recent advances in decoder-only language models (LMs), such as GPT-3 and PaLM adopted for speech-processing tasks, we prop… ▽ More

    Submitted 9 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

  12. arXiv:2307.12767  [pdf, ps, other

    eess.AS cs.SD

    Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

    Authors: Emiru Tsunoo, Hayato Futami, Yosuke Kashiwagi, Siddhant Arora, Shinji Watanabe

    Abstract: Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention encoder-decoder mitigates this issue using soft attention to the input, while it tends to overestimate labels biased towards its training domain, unlike CTC. We exploi… ▽ More

    Submitted 24 July, 2023; originally announced July 2023.

    Comments: Accepted for Interspeech 2023

  13. arXiv:2307.11005  [pdf, other

    cs.CL cs.SD eess.AS

    Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

    Authors: Siddhant Arora, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: There has been an increased interest in the integration of pretrained speech recognition (ASR) and language models (LM) into the SLU framework. However, prior methods often struggle with a vocabulary mismatch between pretrained models, and LM cannot be directly utilized as they diverge from its NLU formulation. In this study, we propose a three-pass end-to-end (E2E) SLU system that effectively int… ▽ More

    Submitted 20 July, 2023; originally announced July 2023.

    Comments: Accepted at INTERSPEECH 2023

  14. arXiv:2306.01247  [pdf, other

    eess.AS

    Tensor decomposition for minimization of E2E SLU model toward on-device processing

    Authors: Yosuke Kashiwagi, Siddhant Arora, Hayato Futami, Jessica Huynh, Shih-Lun Wu, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

    Abstract: Spoken Language Understanding (SLU) is a critical speech recognition application and is often deployed on edge devices. Consequently, on-device processing plays a significant role in the practical implementation of SLU. This paper focuses on the end-to-end (E2E) SLU model due to its small latency property, unlike a cascade system, and aims to minimize the computational cost. We reduce the model si… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted by INTERSPEECH 2023

  15. arXiv:2305.01620  [pdf, ps, other

    cs.CL cs.SD eess.AS

    A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

    Authors: Siddhant Arora, Hayato Futami, Shih-Lun Wu, Jessica Huynh, Yifan Peng, Yosuke Kashiwagi, Emiru Tsunoo, Brian Yan, Shinji Watanabe

    Abstract: Recently there have been efforts to introduce new benchmark tasks for spoken language understanding (SLU), like semantic parsing. In this paper, we describe our proposed spoken semantic parsing system for the quality track (Track 1) in Spoken Language Understanding Grand Challenge which is part of ICASSP Signal Processing Grand Challenge 2023. We experiment with both end-to-end and pipeline system… ▽ More

    Submitted 6 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: First Place in Track 1 of STOP Challenge, which is part of ICASSP Signal Processing Grand Challenge 2023

  16. arXiv:2305.01194  [pdf, ps, other

    cs.CL cs.SD eess.AS

    The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

    Authors: Hayato Futami, Jessica Huynh, Siddhant Arora, Shih-Lun Wu, Yosuke Kashiwagi, Yifan Peng, Brian Yan, Emiru Tsunoo, Shinji Watanabe

    Abstract: This paper describes our system for the low-resource domain adaptation track (Track 3) in Spoken Language Understanding Grand Challenge, which is a part of ICASSP Signal Processing Grand Challenge 2023. In the track, we adopt a pipeline approach of ASR and NLU. For ASR, we fine-tune Whisper for each domain with upsampling. For NLU, we fine-tune BART on all the Track3 data and then on low-resource… ▽ More

    Submitted 11 May, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: To appear at ICASSP2023

  17. arXiv:2211.08726  [pdf, other

    cs.CL cs.SD eess.AS

    Streaming Joint Speech Recognition and Disfluency Detection

    Authors: Hayato Futami, Emiru Tsunoo, Kentaro Shibata, Yosuke Kashiwagi, Takao Okuda, Siddhant Arora, Shinji Watanabe

    Abstract: Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to… ▽ More

    Submitted 11 May, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: Accepted at ICASSP2023

  18. arXiv:2206.07430  [pdf, ps, other

    eess.AS cs.SD

    Residual Language Model for End-to-end Speech Recognition

    Authors: Emiru Tsunoo, Yosuke Kashiwagi, Chaitanya Narisetty, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition suffers from adaptation to unknown target domain speech despite being trained with a large amount of paired audio--text data. Recent studies estimate a linguistic bias of the model as the internal language model (LM). To effectively adapt to the target domain, the internal LM is subtracted from the posterior during inference and fused with an external target… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Accepted for Interspeech2022

  19. arXiv:2202.01405  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Speech Recognition and Audio Captioning

    Authors: Chaitanya Narisetty, Emiru Tsunoo, Xuankai Chang, Yosuke Kashiwagi, Michael Hentschel, Shinji Watanabe

    Abstract: Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. Most end-to-end monaural speech recognition systems either remove these background sounds using speech enhancement or train noise-robust models. For better model interpretability and holistic understanding, we aim to bring together the growing field of automated audio captioning (AA… ▽ More

    Submitted 2 February, 2022; originally announced February 2022.

    Comments: 5 pages, 2 figures. Accepted for ICASSP 2022

  20. arXiv:2201.10190  [pdf, ps, other

    eess.AS cs.SD

    Run-and-back stitch search: novel block synchronous decoding for streaming encoder-decoder ASR

    Authors: Emiru Tsunoo, Chaitanya Narisetty, Michael Hentschel, Yosuke Kashiwagi, Shinji Watanabe

    Abstract: A streaming style inference of encoder-decoder automatic speech recognition (ASR) system is important for reducing latency, which is essential for interactive use cases. To this end, we propose a novel blockwise synchronous decoding algorithm with a hybrid approach that combines endpoint prediction and endpoint post-determination. In the endpoint prediction, we compute the expectation of the numbe… ▽ More

    Submitted 25 January, 2022; originally announced January 2022.

    Comments: Accepted for ICASSP2022

  21. arXiv:2110.05968  [pdf, ps, other

    eess.AS cs.AI

    Improving Character Error Rate Is Not Equal to Having Clean Speech: Speech Enhancement for ASR Systems with Black-box Acoustic Models

    Authors: Ryosuke Sawata, Yosuke Kashiwagi, Shusuke Takahashi

    Abstract: A deep neural network (DNN)-based speech enhancement (SE) aiming to maximize the performance of an automatic speech recognition (ASR) system is proposed in this paper. In order to optimize the DNN-based SE model in terms of the character error rate (CER), which is one of the metric to evaluate the ASR system and generally non-differentiable, our method uses two DNNs: one for speech processing and… ▽ More

    Submitted 22 February, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: Accepted by ICASSP 2022

  22. arXiv:2106.03419  [pdf, ps, other

    eess.AS cs.SD

    Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios

    Authors: Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe

    Abstract: Although end-to-end automatic speech recognition (E2E ASR) has achieved great performance in tasks that have numerous paired data, it is still challenging to make E2E ASR robust against noisy and low-resource conditions. In this study, we investigated data augmentation methods for E2E ASR in distant-talk scenarios. E2E ASR models are trained on the series of CHiME challenge datasets, which are sui… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted for Interspeech2021

  23. arXiv:2102.09168  [pdf, other

    eess.AS cs.SD

    Gaussian Kernelized Self-Attention for Long Sequence Data and Its Application to CTC-based Speech Recognition

    Authors: Yosuke Kashiwagi, Emiru Tsunoo, Shinji Watanabe

    Abstract: Self-attention (SA) based models have recently achieved significant performance improvements in hybrid and end-to-end automatic speech recognition (ASR) systems owing to their flexible context modeling capability. However, it is also known that the accuracy degrades when applying SA to long sequence data. This is mainly due to the length mismatch between the inference and training data because the… ▽ More

    Submitted 18 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP2021

  24. arXiv:2006.14941  [pdf, ps, other

    eess.AS cs.SD

    Streaming Transformer ASR with Blockwise Synchronous Beam Search

    Authors: Emiru Tsunoo, Yosuke Kashiwagi, Shinji Watanabe

    Abstract: The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source--target attention. In this paper, we propose a novel blockwise synchronous beam search algorit… ▽ More

    Submitted 17 November, 2020; v1 submitted 25 June, 2020; originally announced June 2020.

    Comments: Accepted for SLT 2021

  25. arXiv:1910.11871  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Towards Online End-to-end Transformer Automatic Speech Recognition

    Authors: Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

    Abstract: The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute self-attention. We have proposed a block processing method for the Transformer encoder by introducing a context-awar… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: arXiv admin note: text overlap with arXiv:1910.07204

  26. arXiv:1910.07204  [pdf, ps, other

    eess.AS cs.CL

    Transformer ASR with Contextual Block Processing

    Authors: Emiru Tsunoo, Yosuke Kashiwagi, Toshiyuki Kumakura, Shinji Watanabe

    Abstract: The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in that the entire input sequence is required to compute self-attention. In this paper, we propose a new block processing method for the Transformer encoder by in… ▽ More

    Submitted 16 October, 2019; originally announced October 2019.

    Comments: Accepted for ASRU 2019

  27. arXiv:1905.07149  [pdf, ps, other

    eess.AS cs.CL cs.SD

    End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System

    Authors: Emiru Tsunoo, Yosuke Kashiwagi, Satoshi Asakawa, Toshiyuki Kumakura

    Abstract: An on-device DNN-HMM speech recognition system efficiently works with a limited vocabulary in the presence of a variety of predictable noise. In such a case, vocabulary and environment adaptation is highly effective. In this paper, we propose a novel method of end-to-end (E2E) adaptation, which adjusts not only an acoustic model (AM) but also a weighted finite-state transducer (WFST). We convert a… ▽ More

    Submitted 24 June, 2019; v1 submitted 17 May, 2019; originally announced May 2019.

    Comments: accepted for Interspeech 2019