Skip to main content

Showing 1–23 of 23 results for author: Ashihara, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.06660  [pdf, ps, other

    cs.CL cs.SD eess.AS

    TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

    Authors: Junyi Peng, Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Černocký

    Abstract: Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the… ▽ More

    Submitted 10 May, 2025; originally announced May 2025.

    Comments: Accepted at ICASSP 2025

  2. arXiv:2410.12182  [pdf, other

    eess.AS cs.SD

    Guided Speaker Embedding

    Authors: Shota Horiguchi, Takafumi Moriya, Atsushi Ando, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

    Abstract: This paper proposes a guided speaker embedding extraction system, which extracts speaker embeddings of the target speaker using speech activities of target and interference speakers as clues. Several methods for long-form overlapped multi-speaker audio processing are typically two-staged: i) segment-level processing and ii) inter-segment speaker matching. Speaker embeddings are often used for the… ▽ More

    Submitted 1 January, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: Accepted to ICASSP 2025

  3. arXiv:2410.11243  [pdf, other

    cs.SD cs.CL eess.AS

    Investigation of Speaker Representation for Target-Speaker Speech Processing

    Authors: Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, Hiroshi Sato

    Abstract: Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker's speech even when it is corrupted by interfering speakers. While most studies have focused on training schemes or system architectures for each spec… ▽ More

    Submitted 14 October, 2024; originally announced October 2024.

    Comments: Accepted at IEEE SLT 2024

  4. arXiv:2409.20313  [pdf, other

    eess.AS cs.CL cs.SD

    Boosting Hybrid Autoregressive Transducer-based ASR with Internal Acoustic Model Training and Dual Blank Thresholding

    Authors: Takafumi Moriya, Takanori Ashihara, Masato Mimura, Hiroshi Sato, Kohei Matsuura, Ryo Masumura, Taichi Asami

    Abstract: A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: Accepted to Interspeech 2024

  5. arXiv:2409.20301  [pdf, other

    eess.AS cs.CL cs.SD

    Alignment-Free Training for Transducer-based Multi-Talker ASR

    Authors: Takafumi Moriya, Shota Horiguchi, Marc Delcroix, Ryo Masumura, Takanori Ashihara, Hiroshi Sato, Kohei Matsuura, Masato Mimura

    Abstract: Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a… ▽ More

    Submitted 30 September, 2024; originally announced September 2024.

    Comments: Submitted to ICASSP 2025

  6. arXiv:2408.17142  [pdf, other

    eess.AS cs.SD

    Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings

    Authors: Shota Horiguchi, Atsushi Ando, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato, Naohiro Tawara, Marc Delcroix

    Abstract: This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker speech applications such as speaker diarization and target-speaker speech processing. Despite the challenges of obtaining a single speaker's speech without pre… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

    Comments: Accepted to IEEE SLT 2024

  7. arXiv:2408.00205  [pdf, other

    cs.CL eess.AS

    Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Masato Mimura, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Usin… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

    Comments: Accepted to Interspeech2024. Dataset: https://huggingface.co/datasets/komats/mega-ssum

  8. arXiv:2407.01857  [pdf, other

    eess.AS cs.SD eess.SP

    SpeakerBeam-SS: Real-time Target Speaker Extraction with Lightweight Conv-TasNet and State Space Modeling

    Authors: Hiroshi Sato, Takafumi Moriya, Masato Mimura, Shota Horiguchi, Tsubasa Ochiai, Takanori Ashihara, Atsushi Ando, Kentaro Shinayama, Marc Delcroix

    Abstract: Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  9. arXiv:2407.01291  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

    Authors: Kenichi Fujita, Takanori Ashihara, Marc Delcroix, Yusuke Ijima

    Abstract: The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-a… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: 5 pages,3 figures, Accepted to INTERSPEECH 2024

  10. arXiv:2406.18972  [pdf, ps, other

    eess.AS cs.CL

    Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over

    Authors: Atsunori Ogawa, Naoyuki Kamo, Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Takatomo Kano, Naohiro Tawara, Marc Delcroix

    Abstract: Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LL… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

    Comments: 5 pages

  11. arXiv:2402.13200  [pdf, other

    eess.AS cs.SD

    Probing Self-supervised Learning Models with Target Speech Extraction

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

    Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction c… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  12. arXiv:2401.17632  [pdf, other

    cs.CL cs.SD eess.AS

    What Do Self-Supervised Speech and Speaker Models Learn? New Findings From a Cross Model Layer-Wise Analysis

    Authors: Takanori Ashihara, Marc Delcroix, Takafumi Moriya, Kohei Matsuura, Taichi Asami, Yusuke Ijima

    Abstract: Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations. Speech SSL models, such as WavLM, employ masked prediction training to encode general-purpose representations. In contrast, speaker SSL models, exemplified by DINO-based models, adopt utterance-level training objectives primarily for speaker representation. Understanding how these model… ▽ More

    Submitted 31 January, 2024; originally announced January 2024.

    Comments: Accepted at ICASSP 2024

  13. arXiv:2401.05111  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Noise-robust zero-shot text-to-speech synthesis conditioned on self-supervised speech-representation model with adapters

    Authors: Kenichi Fujita, Hiroshi Sato, Takanori Ashihara, Hiroki Kanagawa, Marc Delcroix, Takafumi Moriya, Yusuke Ijima

    Abstract: The zero-shot text-to-speech (TTS) method, based on speaker embeddings extracted from reference speech using self-supervised learning (SSL) speech representations, can reproduce speaker characteristics very accurately. However, this approach suffers from degradation in speech synthesis quality when the reference speech contains noise. In this paper, we propose a noise-robust zero-shot TTS method.… ▽ More

    Submitted 10 January, 2024; originally announced January 2024.

    Comments: 5 pages,3 figures, Accepted to IEEE ICASSP 2024

  14. arXiv:2306.08374  [pdf, other

    cs.CL cs.SD eess.AS

    SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge?

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka, Yusuke Ijima, Taichi Asami, Marc Delcroix, Yukinori Honma

    Abstract: Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing spoken language understanding tasks, implying that the SSL models have the potential to learn not only acoustic but also linguistic information. In this paper,… ▽ More

    Submitted 27 August, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

    Comments: Accepted at INTERSPEECH 2023. This paper has been extended in a subsequent journal paper, see https://ieeexplore.ieee.org/abstract/document/10597571

  15. arXiv:2306.04233  [pdf, other

    cs.CL cs.SD eess.AS

    Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Takatomo Kano, Atsunori Ogawa, Marc Delcroix

    Abstract: End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. This approach is promising because it, in contrast to the conventional cascade approach, can utilize full acoustical information and mitigate to the propagation of transcription errors. However, due to the high cost of collecting speech-summary pairs, an E2E SSum model… ▽ More

    Submitted 7 June, 2023; originally announced June 2023.

    Comments: Accepted by Interspeech 2023

  16. arXiv:2305.14723  [pdf, other

    eess.AS cs.SD

    Downstream Task Agnostic Speech Enhancement with Self-Supervised Representation Loss

    Authors: Hiroshi Sato, Ryo Masumura, Tsubasa Ochiai, Marc Delcroix, Takafumi Moriya, Takanori Ashihara, Kentaro Shinayama, Saki Mizuno, Mana Ihori, Tomohiro Tanaka, Nobukatsu Hojo

    Abstract: Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits… ▽ More

    Submitted 24 May, 2023; originally announced May 2023.

    Comments: 4 pages , 2 figures, Accepted to Interspeech 2023

  17. arXiv:2305.05201  [pdf, other

    cs.CL cs.SD eess.AS

    Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

    Abstract: Self-supervised learning (SSL) has been dramatically successful not only in monolingual but also in cross-lingual settings. However, since the two settings have been studied individually in general, there has been little research focusing on how effective a cross-lingual model is in comparison with a monolingual model. In this paper, we investigate this fundamental question empirically with Japane… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2023

  18. Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

    Authors: Kenichi Fujita, Takanori Ashihara, Hiroki Kanagawa, Takafumi Moriya, Yusuke Ijima

    Abstract: This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to o… ▽ More

    Submitted 24 April, 2023; originally announced April 2023.

    Comments: 5 pages,3 figures, Accepted to IEEE ICASSP 2023 workshop Self-supervision in Audio, Speech and Beyond

    Journal ref: 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), 2023, pp. 1-5,

  19. arXiv:2303.00978  [pdf, other

    cs.CL eess.AS

    Leveraging Large Text Corpora for End-to-End Speech Summarization

    Authors: Kohei Matsuura, Takanori Ashihara, Takafumi Moriya, Tomohiro Tanaka, Atsunori Ogawa, Marc Delcroix, Ryo Masumura

    Abstract: End-to-end speech summarization (E2E SSum) is a technique to directly generate summary sentences from speech. Compared with the cascade approach, which combines automatic speech recognition (ASR) and text summarization models, the E2E approach is more promising because it mitigates ASR errors, incorporates nonverbal information, and simplifies the overall system. However, since collecting a large… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023

  20. arXiv:2210.15937  [pdf, other

    cs.CL cs.SD eess.AS

    On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

    Authors: Atsushi Ando, Ryo Masumura, Akihiko Takashima, Satoshi Suzuki, Naoki Makishima, Keita Suzuki, Takafumi Moriya, Takanori Ashihara, Hiroshi Sato

    Abstract: This paper investigates the effectiveness and implementation of modality-specific large-scale pre-trained encoders for multimodal sentiment analysis~(MSA). Although the effectiveness of pre-trained encoders in various fields has been reported, conventional MSA methods employ them for only linguistic modality, and their application has not been investigated. This paper compares the features yielded… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: Accepted to SLT 2022

  21. arXiv:2207.06867  [pdf, other

    cs.CL cs.SD eess.AS

    Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

    Authors: Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

    Abstract: Self-supervised learning (SSL) is seen as a very promising approach with high performance for several speech downstream tasks. Since the parameters of SSL models are generally so large that training and inference require a lot of memory and computational cost, it is desirable to produce compact SSL models without a significant performance degradation by applying compression methods such as knowled… ▽ More

    Submitted 1 September, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  22. arXiv:2107.01569  [pdf, other

    cs.CL cs.LG

    Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition

    Authors: Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima

    Abstract: We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence mapping problems. The most successful method is to use both input speech and its ASR output text as the inp… ▽ More

    Submitted 4 July, 2021; originally announced July 2021.

    Comments: Accepted to Interspeech 2021

  23. arXiv:2102.08154  [pdf, ps, other

    cs.CL cs.LG

    End-to-End Automatic Speech Recognition with Deep Mutual Learning

    Authors: Ryo Masumura, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Takanori Ashihara

    Abstract: This paper is the first study to apply deep mutual learning (DML) to end-to-end ASR models. In DML, multiple models are trained simultaneously and collaboratively by mimicking each other throughout the training process, which helps to attain the global optimum and prevent models from making over-confident predictions. While previous studies applied DML to simple multi-class classification problems… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

    Comments: Accepted at Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2020, pp.632-637