Skip to main content

Showing 1–38 of 38 results for author: Kawahara, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2506.21191  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Prompt-Guided Turn-Taking Prediction

    Authors: Koji Inoue, Mikey Elmers, Yahui Fu, Zi Haur Pang, Divesh Lala, Keiko Ochi, Tatsuya Kawahara

    Abstract: Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit cont… ▽ More

    Submitted 3 July, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

    Comments: This paper has been accepted for presentation at SIGdial Meeting on Discourse and Dialogue 2025 (SIGDIAL 2025) and represents the author's version of the work

  2. arXiv:2505.13983  [pdf, other

    cs.SD eess.AS

    Combining Deterministic Enhanced Conditions with Dual-Streaming Encoding for Diffusion-Based Speech Enhancement

    Authors: Hao Shi, Xugang Lu, Kazuki Shimada, Tatsuya Kawahara

    Abstract: Diffusion-based speech enhancement (SE) models need to incorporate correct prior knowledge as reliable conditions to generate accurate predictions. However, providing reliable conditions using noisy features is challenging. One solution is to use features enhanced by deterministic methods as conditions. However, the information distortion and loss caused by deterministic methods might affect the d… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  3. arXiv:2505.13978  [pdf, other

    cs.SD eess.AS

    Bridging Speech Emotion Recognition and Personality: Dataset and Temporal Interaction Condition Network

    Authors: Yuan Gao, Hao Shi, Yahui Fu, Chenhui Chu, Tatsuya Kawahara

    Abstract: This study investigates the interaction between personality traits and emotional expression, exploring how personality information can improve speech emotion recognition (SER). We collected personality annotation for the IEMOCAP dataset, and the statistical analysis identified significant correlations between personality traits and emotional expressions. To extract finegrained personality features… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  4. arXiv:2501.16643  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue

    Authors: Koji Inoue, Divesh Lala, Mikey Elmers, Keiko Ochi, Tatsuya Kawahara

    Abstract: Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addresse… ▽ More

    Submitted 18 March, 2025; v1 submitted 27 January, 2025; originally announced January 2025.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2025 (IWSDS 2025) and represents the author's version of the work

  5. arXiv:2410.15929  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

    Authors: Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara

    Abstract: In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel… ▽ More

    Submitted 5 February, 2025; v1 submitted 21 October, 2024; originally announced October 2024.

    Comments: This paper has been accepted for presentation at the main conference of 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025) and represents the author's version of the work

  6. Efficient and Robust Long-Form Speech Recognition with Hybrid H3-Conformer

    Authors: Tomoki Honda, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Recently, Conformer has achieved state-of-the-art performance in many speech recognition tasks. However, the Transformer-based models show significant deterioration for long-form speech, such as lectures, because the self-attention mechanism becomes unreliable with the computation of the square order of the input length. To solve the problem, we incorporate a kind of state-space model, Hungry Hung… ▽ More

    Submitted 5 October, 2024; originally announced October 2024.

    Comments: Submitted to InterSpeech2024, Sample code is available at https://github.com/mirrormouse/Hybrid-H3-Conformer

  7. arXiv:2410.01365  [pdf

    eess.IV cs.CV

    Anti-biofouling Lensless Camera System with Deep Learning based Image Reconstruction

    Authors: Naoki Ide, Tomohiro Kawahara, Hiroshi Ueno, Daiki Yanagidaira, Susumu Takatsuka

    Abstract: In recent years, there has been an increasing demand for underwater cameras that monitor the condition of offshore structures and check the number of individuals in aqua culture environments with long-period observation. One of the significant issues with this observation is that biofouling sticks to the aperture and lens densely and prevents cameras from capturing clear images. This study examine… ▽ More

    Submitted 2 October, 2024; originally announced October 2024.

    Comments: 9 pages, 8 figures, Ocean Optics 2024

  8. arXiv:2409.08039  [pdf, other

    cs.SD eess.AS

    Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

    Authors: Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, Tatsuya Kawahara

    Abstract: This study presents an innovative Zero-Shot any-to-any Singing Voice Conversion (SVC) method, leveraging a novel clustering-based phoneme representation to effectively separate content, timbre, and singing style. This approach enables precise voice characteristic manipulation. We discovered that datasets with fewer recordings per artist are more susceptible to timbre leakage. Extensive testing on… ▽ More

    Submitted 14 October, 2024; v1 submitted 12 September, 2024; originally announced September 2024.

  9. arXiv:2409.00815  [pdf, other

    cs.SD cs.AI eess.AS

    Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

    Authors: Hao Shi, Yuan Gao, Zhaoheng Ni, Tatsuya Kawahara

    Abstract: Serialized output training (SOT) attracts increasing attention due to its convenience and flexibility for multi-speaker automatic speech recognition (ASR). However, it is not easy to train with attention loss only. In this paper, we propose the overlapped encoding separation (EncSep) to fully utilize the benefits of the connectionist temporal classification (CTC) and attention hybrid loss. This ad… ▽ More

    Submitted 10 September, 2024; v1 submitted 1 September, 2024; originally announced September 2024.

  10. arXiv:2408.16180  [pdf, other

    eess.AS cs.CL cs.SD

    Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

    Authors: Yuka Ko, Sheng Li, Chao-Han Huck Yang, Tatsuya Kawahara

    Abstract: With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text u… ▽ More

    Submitted 11 October, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

  11. arXiv:2403.06487  [pdf, other

    cs.CL cs.SD eess.AS

    Multilingual Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The re… ▽ More

    Submitted 14 March, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

    Comments: This paper has been accepted for presentation at The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) and represents the author's version of the work

  12. arXiv:2402.18275  [pdf, other

    cs.SD cs.CL eess.AS

    Exploration of Adapter for Noise Robust Automatic Speech Recognition

    Authors: Hao Shi, Tatsuya Kawahara

    Abstract: Adapting an automatic speech recognition (ASR) system to unseen noise environments is crucial. Integrating adapters into neural networks has emerged as a potent technique for transfer learning. This study thoroughly investigates adapter-based ASR adaptation in noisy environments. We conducted experiments using the CHiME--4 dataset. The results show that inserting the adapter in the shallow layer y… ▽ More

    Submitted 4 June, 2024; v1 submitted 28 February, 2024; originally announced February 2024.

  13. arXiv:2401.13249  [pdf, other

    eess.AS cs.MM

    MOS-FAD: Improving Fake Audio Detection Via Automatic Mean Opinion Score Prediction

    Authors: Wangjin Zhou, Zhengdong Yang, Chenhui Chu, Sheng Li, Raj Dabre, Yi Zhao, Tatsuya Kawahara

    Abstract: Automatic Mean Opinion Score (MOS) prediction is employed to evaluate the quality of synthetic speech. This study extends the application of predicted MOS to the task of Fake Audio Detection (FAD), as we expect that MOS can be used to assess how close synthesized speech is to the natural human voice. We propose MOS-FAD, where MOS can be leveraged at two key points in FAD: training data selection a… ▽ More

    Submitted 24 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Accepted in ICASSP2024

  14. arXiv:2401.04868  [pdf, other

    cs.CL cs.HC cs.SD eess.AS

    Real-time and Continuous Turn-taking Prediction Using Voice Activity Projection

    Authors: Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

    Abstract: A demonstration of a real-time and continuous turn-taking prediction system is presented. The system is based on a voice activity projection (VAP) model, which directly maps dialogue stereo audio to future voice activities. The VAP model includes contrastive predictive coding (CPC) and self-attention transformers, followed by a cross-attention transformer. We examine the effect of the input contex… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

    Comments: This paper has been accepted for presentation at International Workshop on Spoken Dialogue Systems Technology 2024 (IWSDS 2024) and represents the author's version of the work

  15. arXiv:2309.09223  [pdf, other

    cs.SD eess.AS

    Zero- and Few-shot Sound Event Localization and Detection

    Authors: Kazuki Shimada, Kengo Uchida, Yuichiro Koyama, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji, Tatsuya Kawahara

    Abstract: Sound event localization and detection (SELD) systems estimate direction-of-arrival (DOA) and temporal activation for sets of target classes. Neural network (NN)-based SELD systems have performed well in various sets of target classes, but they only output the DOA and temporal activation of preset classes trained before inference. To customize target classes after training, we tackle zero- and few… ▽ More

    Submitted 17 January, 2024; v1 submitted 17 September, 2023; originally announced September 2023.

    Comments: 5 pages, 4 figures, accepted for publication in IEEE ICASSP 2024

  16. arXiv:2305.10734  [pdf, other

    cs.SD cs.CL eess.AS

    Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

    Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji

    Abstract: Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us… ▽ More

    Submitted 28 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  17. arXiv:2303.14593  [pdf, other

    cs.SD eess.AS

    Time-domain Speech Enhancement Assisted by Multi-resolution Frequency Encoder and Decoder

    Authors: Hao Shi, Masato Mimura, Longbiao Wang, Jianwu Dang, Tatsuya Kawahara

    Abstract: Time-domain speech enhancement (SE) has recently been intensively investigated. Among recent works, DEMUCS introduces multi-resolution STFT loss to enhance performance. However, some resolutions used for STFT contain non-stationary signals, and it is challenging to learn multi-resolution frequency losses simultaneously with only one output. For better use of multi-resolution frequency information,… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

  18. arXiv:2303.00146  [pdf, other

    cs.HC cs.RO cs.SD eess.AS

    I Know Your Feelings Before You Do: Predicting Future Affective Reactions in Human-Computer Dialogue

    Authors: Yuanchao Li, Koji Inoue, Leimin Tian, Changzeng Fu, Carlos Ishi, Hiroshi Ishiguro, Tatsuya Kawahara, Catherine Lai

    Abstract: Current Spoken Dialogue Systems (SDSs) often serve as passive listeners that respond only after receiving user speech. To achieve human-like dialogue, we propose a novel future prediction architecture that allows an SDS to anticipate future affective reactions based on its current behaviors before the user speaks. In this work, we investigate two scenarios: speech and laughter. In speech, we propo… ▽ More

    Submitted 17 December, 2024; v1 submitted 28 February, 2023; originally announced March 2023.

    Comments: Accepted to CHI2023 Late-Breaking Work

  19. arXiv:2209.04062  [pdf, other

    cs.CL cs.SD eess.AS

    Non-autoregressive Error Correction for CTC-based ASR with Phone-conditioned Masked LM

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive in automatic speech recognition (ASR) because of their non-autoregressive nature. To take advantage of text-only data, language model (LM) integration approaches such as rescoring and shallow fusion have been widely used for CTC. However, they lose CTC's non-autoregressive nature because of the need for beam search, which slo… ▽ More

    Submitted 8 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  20. arXiv:2209.02030  [pdf, other

    cs.CL cs.SD eess.AS

    Distilling the Knowledge of BERT for CTC-based ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Connectionist temporal classification (CTC) -based models are attractive because of their fast inference in automatic speech recognition (ASR). Language model (LM) integration approaches such as shallow fusion and rescoring can improve the recognition accuracy of CTC-based ASR by taking advantage of the knowledge in text corpora. However, they significantly slow down the inference of CTC. In this… ▽ More

    Submitted 5 September, 2022; originally announced September 2022.

  21. arXiv:2207.03169  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-end Speech-to-Punctuated-Text Recognition

    Authors: Jumon Nozaki, Tatsuya Kawahara, Kenkichi Ishizuka, Taiichi Hashimoto

    Abstract: Conventional automatic speech recognition systems do not produce punctuation marks which are important for the readability of the speech recognition results. They are also needed for subsequent natural language processing tasks such as machine translation. There have been a lot of works on punctuation prediction models that insert punctuation marks into speech recognition results as post-processin… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: Accepted to INTERSPEECH2022

  22. arXiv:2110.01857  [pdf, other

    cs.CL eess.AS

    ASR Rescoring and Confidence Estimation with ELECTRA

    Authors: Hayato Futami, Hirofumi Inaguma, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: In automatic speech recognition (ASR) rescoring, the hypothesis with the fewest errors should be selected from the n-best list using a language model (LM). However, LMs are usually trained to maximize the likelihood of correct word sequences, not to detect ASR errors. We propose an ASR rescoring method for directly detecting errors with ELECTRA, which is originally a pre-training method for NLP ta… ▽ More

    Submitted 5 October, 2021; originally announced October 2021.

    Comments: Accepted in ASRU2021

  23. arXiv:2109.04411  [pdf, other

    eess.AS cs.CL cs.SD

    Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelera… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

  24. arXiv:2107.07509  [pdf, other

    eess.AS cs.CL cs.SD

    VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-… ▽ More

    Submitted 15 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  25. arXiv:2107.00635  [pdf, other

    eess.AS cs.CL cs.SD

    StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control t… ▽ More

    Submitted 15 July, 2021; v1 submitted 1 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  26. arXiv:2104.06457  [pdf, other

    cs.CL cs.SD eess.AS

    Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation

    Authors: Hirofumi Inaguma, Tatsuya Kawahara, Shinji Watanabe

    Abstract: A conventional approach to improving the performance of end-to-end speech translation (E2E-ST) models is to leverage the source transcription via pre-training and joint training with automatic speech recognition (ASR) and neural machine translation (NMT) tasks. However, since the input modalities are different, it is difficult to leverage source language text successfully. In this work, we focus o… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL-HLT 2021 (short paper)

  27. arXiv:2103.00422  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition

    Authors: Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. AED models have achieved competitive performance in offline scenarios by jointly optimizing all components. They have recently been extended to an online streaming framework via models such as monotonic chunkwise attention (MoChA). However, the… ▽ More

    Submitted 22 August, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

  28. arXiv:2010.13047  [pdf, other

    cs.CL cs.SD eess.AS

    Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based tr… ▽ More

    Submitted 18 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted at IEEE ICASSP 2021

  29. arXiv:2008.12048  [pdf, ps, other

    eess.AS

    End-to-end Music-mixed Speech Recognition

    Authors: Jeongwoo Woo, Masato Mimura, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: Automatic speech recognition (ASR) in multimedia content is one of the promising applications, but speech data in this kind of content are frequently mixed with background music, which is harmful for the performance of ASR. In this study, we propose a method for improving ASR with background music based on time-domain source separation. We utilize Conv-TasNet as a separation network, which has ach… ▽ More

    Submitted 27 August, 2020; originally announced August 2020.

    Comments: Submitted to APSIPA 2020

  30. arXiv:2008.03822  [pdf, other

    cs.CL eess.AS

    Distilling the Knowledge of BERT for Sequence-to-Sequence ASR

    Authors: Hayato Futami, Hirofumi Inaguma, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Attention-based sequence-to-sequence (seq2seq) models have achieved promising results in automatic speech recognition (ASR). However, as these models decode in a left-to-right way, they do not have access to context on the right. We leverage both left and right context by applying BERT as an external language model to seq2seq ASR through knowledge distillation. In our proposed method, BERT generat… ▽ More

    Submitted 9 August, 2020; originally announced August 2020.

    Comments: Accepted in INTERSPEECH2020

  31. arXiv:2005.09394  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Enhancing Monotonic Multihead Attention for Streaming ASR

    Authors: Hirofumi Inaguma, Masato Mimura, Tatsuya Kawahara

    Abstract: We investigate a monotonic multihead attention (MMA) by extending hard monotonic attention to Transformer-based automatic speech recognition (ASR) for online streaming applications. For streaming inference, all monotonic attention (MA) heads should learn proper alignments because the next token is not generated until all heads detect the corresponding token boundaries. However, we found not all MA… ▽ More

    Submitted 30 September, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted to Interspeech 2020

  32. arXiv:2005.09256  [pdf, other

    eess.AS cs.CL

    Generative Adversarial Training Data Adaptation for Very Low-resource Automatic Speech Recognition

    Authors: Kohei Matsuura, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: It is important to transcribe and archive speech data of endangered languages for preserving heritages of verbal culture and automatic speech recognition (ASR) is a powerful tool to facilitate this process. However, since endangered languages do not generally have large corpora with many speakers, the performance of ASR models trained on them are considerably poor in general. Nevertheless, we are… ▽ More

    Submitted 31 July, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: Accepted for Interspeech 2020

  33. arXiv:2004.11419  [pdf, other

    cs.SD cs.CL eess.AS

    End-to-end speech-to-dialog-act recognition

    Authors: Viet-Trung Dang, Tianyu Zhao, Sei Ueno, Hirofumi Inaguma, Tatsuya Kawahara

    Abstract: Spoken language understanding, which extracts intents and/or semantic concepts in utterances, is conventionally formulated as a post-processing of automatic speech recognition. It is usually trained with oracle transcripts, but needs to deal with errors by ASR. Moreover, there are acoustic features which are related with intents but not represented with the transcripts. In this paper, we present a… ▽ More

    Submitted 28 July, 2020; v1 submitted 23 April, 2020; originally announced April 2020.

  34. arXiv:2002.06675  [pdf, other

    cs.CL cs.SD eess.AS

    Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language

    Authors: Kohei Matsuura, Sei Ueno, Masato Mimura, Shinsuke Sakai, Tatsuya Kawahara

    Abstract: Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of… ▽ More

    Submitted 16 May, 2020; v1 submitted 16 February, 2020; originally announced February 2020.

    Comments: Accepted in LREC 2020

  35. arXiv:1910.00254  [pdf, ps, other

    cs.CL eess.AS

    Multilingual End-to-End Speech Translation

    Authors: Hirofumi Inaguma, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: In this paper, we propose a simple yet effective framework for multilingual end-to-end speech translation (ST), in which speech utterances in source languages are directly translated to the desired target languages with a universal sequence-to-sequence architecture. While multilingual models have shown to be useful for automatic speech recognition (ASR) and machine translation (MT), this is the fi… ▽ More

    Submitted 31 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

    Comments: Accepted to ASRU 2019

  36. arXiv:1907.05599  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Effective Incorporation of Speaker Information in Utterance Encoding in Dialog

    Authors: Tianyu Zhao, Tatsuya Kawahara

    Abstract: In dialog studies, we often encode a dialog using a hierarchical encoder where each utterance is converted into an utterance vector, and then a sequence of utterance vectors is converted into a dialog vector. Since knowing who produced which utterance is essential to understanding a dialog, conventional methods tried integrating speaker labels into utterance vectors. We found the method problemati… ▽ More

    Submitted 12 July, 2019; originally announced July 2019.

    Comments: 8+1 pages, 3 figures, and 5 tables. Rejected by SIGDIAL 2019

  37. arXiv:1903.09341  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition

    Authors: Kazuki Shimada, Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper describes multichannel speech enhancement for improving automatic speech recognition (ASR) in noisy environments. Recently, the minimum variance distortionless response (MVDR) beamforming has widely been used because it works well if the steering vector of speech and the spatial covariance matrix (SCM) of noise are given. To estimating such spatial information, conventional studies take… ▽ More

    Submitted 31 March, 2019; v1 submitted 21 March, 2019; originally announced March 2019.

  38. arXiv:1710.11439  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

    Authors: Yoshiaki Bando, Masato Mimura, Katsutoshi Itoyama, Kazuyoshi Yoshii, Tatsuya Kawahara

    Abstract: This paper presents a statistical method of single-channel speech enhancement that uses a variational autoencoder (VAE) as a prior distribution on clean speech. A standard approach to speech enhancement is to train a deep neural network (DNN) to take noisy speech as input and output clean speech. Although this supervised approach requires a very large amount of pair data for training, it is not ro… ▽ More

    Submitted 19 March, 2018; v1 submitted 31 October, 2017; originally announced October 2017.

    Comments: 5 pages, 3 figures, version that Eqs. (9), (19), and (20) in v2 (submitted to ICASSP 2018) are corrected. Samples available here: http://sap.ist.i.kyoto-u.ac.jp/members/yoshiaki/demo/vae-nmf/