Skip to main content

Showing 1–50 of 70 results for author: Takamichi, S

.
  1. arXiv:2505.17446  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models

    Authors: Shunsuke Kando, Yusuke Miyao, Shinnosuke Takamichi

    Abstract: The purpose of speech tokenization is to transform a speech signal into a sequence of discrete representations, serving as the foundation for speech language models (SLMs). While speech tokenization has many options, their effect on the performance of SLMs remains unclear. This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units.… ▽ More

    Submitted 31 May, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

    Comments: Accepted to Interspeech2025

  2. arXiv:2410.23279  [pdf, other

    cs.SD cs.AI eess.AS

    A Transformer Model for Segmentation, Classification, and Caller Identification of Marmoset Vocalization

    Authors: Bin Wu, Shinnosuke Takamichi, Sakriani Sakti, Satoshi Nakamura

    Abstract: Marmoset, a highly vocalized primate, has become a popular animal model for studying social-communicative behavior and its underlying mechanism comparing with human infant linguistic developments. In the study of vocal communication, it is vital to know the caller identities, call contents, and vocal exchanges. Previous work of a CNN has achieved a joint model for call segmentation, classification… ▽ More

    Submitted 21 November, 2024; v1 submitted 30 October, 2024; originally announced October 2024.

  3. arXiv:2409.17285  [pdf, other

    cs.SD cs.AI eess.AS

    SpoofCeleb: Speech Deepfake Detection and SASV In The Wild

    Authors: Jee-weon Jung, Yihan Wu, Xin Wang, Ji-Hoon Kim, Soumi Maiti, Yuta Matsunaga, Hye-jin Shim, Jinchuan Tian, Nicholas Evans, Joon Son Chung, Wangyou Zhang, Seyun Um, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: This paper introduces SpoofCeleb, a dataset designed for Speech Deepfake Detection (SDD) and Spoofing-robust Automatic Speaker Verification (SASV), utilizing source data from real-world conditions and spoofing attacks generated by Text-To-Speech (TTS) systems also trained on the same real-world data. Robust recognition systems require speech data recorded in varied acoustic environments with diffe… ▽ More

    Submitted 15 April, 2025; v1 submitted 18 September, 2024; originally announced September 2024.

    Comments: IEEE OJSP. Official document lives at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10839331

  4. arXiv:2409.09988  [pdf, other

    eess.AS cs.SD

    DNN-based ensemble singing voice synthesis with interactions between singers

    Authors: Hiroaki Hyodo, Shinnosuke Takamichi, Tomohiko Nakamura, Junya Koguchi, Hiroshi Saruwatari

    Abstract: We propose a singing voice synthesis (SVS) method for a more unified ensemble singing voice by modeling interactions between singers. Most existing SVS methods aim to synthesize a solo voice, and do not consider interactions between singers, i.e., adjusting one's own voice to the others' voices. Since the production of ensemble voices from solo singing voices ignores the interactions, it can degra… ▽ More

    Submitted 16 September, 2024; originally announced September 2024.

  5. arXiv:2409.08711  [pdf, ps, other

    eess.AS cs.AI

    Text-To-Speech Synthesis In The Wild

    Authors: Jee-weon Jung, Wangyou Zhang, Soumi Maiti, Yihan Wu, Xin Wang, Ji-Hoon Kim, Yuta Matsunaga, Seyun Um, Jinchuan Tian, Hye-jin Shim, Nicholas Evans, Joon Son Chung, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: Traditional Text-to-Speech (TTS) systems rely on studio-quality speech recorded in controlled settings.a Recently, an effort known as noisy-TTS training has emerged, aiming to utilize in-the-wild data. However, the lack of dedicated datasets has been a significant limitation. We introduce the TTS In the Wild (TITW) dataset, which is publicly available, created through a fully automated pipeline ap… ▽ More

    Submitted 1 June, 2025; v1 submitted 13 September, 2024; originally announced September 2024.

    Comments: 5 pages, Interspeech 2025

  6. arXiv:2409.05377  [pdf, other

    eess.AS cs.SD

    BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

    Authors: Detai Xin, Xu Tan, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present BigCodec, a low-bitrate neural speech codec. While recent neural speech codecs have shown impressive progress, their performance significantly deteriorates at low bitrates (around 1 kbps). Although a low bitrate inherently restricts performance, other factors, such as model capacity, also hinder further improvements. To address this problem, we scale up the model size to 159M parameters… ▽ More

    Submitted 9 September, 2024; originally announced September 2024.

    Comments: 4 pages, 1 figure. Audio samples available at: https://aria-k-alethia.github.io/bigcodec-demo/

  7. arXiv:2408.06858  [pdf, other

    eess.AS

    SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis

    Authors: Osamu Take, Shinnosuke Takamichi, Kentaro Seki, Yoshiaki Bando, Hiroshi Saruwatari

    Abstract: This paper presents SaSLaW, a spontaneous dialogue speech corpus containing synchronous recordings of what speakers speak, listen to, and watch. Humans consider the diverse environmental factors and then control the features of their utterances in face-to-face voice communications. Spoken dialogue systems capable of this adaptation to these audio environments enable natural and seamless communicat… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: 5 pages, accepted for INTERSPEECH 2024

  8. arXiv:2407.15828  [pdf, other

    cs.CL cs.SD eess.AS

    J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling

    Authors: Wataru Nakata, Kentaro Seki, Hitomi Yanaka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: Spoken dialogue plays a crucial role in human-AI interactions, necessitating dialogue-oriented spoken language models (SLMs). To develop versatile SLMs, large-scale and diverse speech datasets are essential. Additionally, to ensure hiqh-quality speech generation, the data must be spontaneous like in-wild data and must be acoustically clean with noise removed. Despite the critical need, no open-sou… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 8 pages, 6 figures

  9. arXiv:2407.10118  [pdf, other

    cs.CL

    Textless Dependency Parsing by Labeled Sequence Prediction

    Authors: Shunsuke Kando, Yusuke Miyao, Jason Naradowsky, Shinnosuke Takamichi

    Abstract: Traditional spoken language processing involves cascading an automatic speech recognition (ASR) system into text processing models. In contrast, "textless" methods process speech representations without ASR systems, enabling the direct use of acoustic speech features. Although their effectiveness is shown in capturing acoustic features, it is unclear in capturing lexical knowledge. This paper prop… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: Accepted to Interspeech 2024

  10. arXiv:2407.04270  [pdf, other

    eess.AS cs.SD

    Who Finds This Voice Attractive? A Large-Scale Experiment Using In-the-Wild Data

    Authors: Hitoshi Suda, Aya Watanabe, Shinnosuke Takamichi

    Abstract: This paper introduces CocoNut-Humoresque, an open-source large-scale speech likability corpus that includes speech segments and their per-listener likability scores. Evaluating voice likability is essential to designing preferable voices for speech systems, such as dialogue or announcement systems. In this study, we let 885 listeners rate 1800 speech segments of a wide range of speakers regarding… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Accepted at Interspeech 2024

  11. arXiv:2406.17722  [pdf, other

    cs.SD eess.AS

    Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

    Authors: Kentaro Seki, Shinnosuke Takamichi, Norihiro Takamune, Yuki Saito, Kanami Imamura, Hiroshi Saruwatari

    Abstract: This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms, ignoring the stereo listening experience inherent in human hearing. Our baseline approach addresses this gap by integrating blind source separation (BSS), voice conve… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  12. arXiv:2406.07280  [pdf, ps, other

    cs.SD eess.AS

    Noise-Robust Voice Conversion by Conditional Denoising Training Using Latent Variables of Recording Quality and Environment

    Authors: Takuto Igarashi, Yuki Saito, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 5 pages, accepted for INTERSPEECH 2024, audio samples: http://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/IS2024_CDT_supplementary/demo_cdt.html

  13. arXiv:2406.07254  [pdf, ps, other

    cs.SD eess.AS

    SRC4VC: Smartphone-Recorded Corpus for Voice Conversion Benchmark

    Authors: Yuki Saito, Takuto Igarashi, Kentaro Seki, Shinnosuke Takamichi, Ryuichi Yamamoto, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present SRC4VC, a new corpus containing 11 hours of speech recorded on smartphones by 100 Japanese speakers. Although high-quality multi-speaker corpora can advance voice conversion (VC) technologies, they are not always suitable for testing VC when low-quality speech recording is given as the input. To this end, we first asked 100 crowdworkers to record their voice samples using smartphones. T… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted for INTERSPEECH 2024, corpus project page: https://y-saito.sakura.ne.jp/sython/Corpus/SRC4VC/index.html

  14. arXiv:2406.00899  [pdf, other

    cs.CL cs.SD eess.AS

    YODAS: Youtube-Oriented Dataset for Audio and Speech

    Authors: Xinjian Li, Shinnosuke Takamichi, Takaaki Saeki, William Chen, Sayaka Shiota, Shinji Watanabe

    Abstract: In this study, we introduce YODAS (YouTube-Oriented Dataset for Audio and Speech), a large-scale, multilingual dataset comprising currently over 500k hours of speech data in more than 100 languages, sourced from both labeled and unlabeled YouTube speech datasets. The labeled subsets, including manual or automatic subtitles, facilitate supervised model training. Conversely, the unlabeled subsets ar… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: ASRU 2023

  15. arXiv:2404.03204  [pdf, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

    Authors: Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

    Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. Th… ▽ More

    Submitted 19 May, 2024; v1 submitted 4 April, 2024; originally announced April 2024.

  16. arXiv:2403.13353  [pdf, other

    cs.SD eess.AS

    Building speech corpus with diverse voice characteristics for its prompt-based representation

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

    Abstract: In text-to-speech synthesis, the ability to control voice characteristics is vital for various applications. By leveraging thriving text prompt-based generation techniques, it should be possible to enhance the nuanced control of voice characteristics. While previous research has explored the prompt-based manipulation of voice characteristics, most studies have used pre-recorded speech, which limit… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing. arXiv admin note: text overlap with arXiv:2309.13509

  17. arXiv:2401.16812  [pdf, other

    cs.SD eess.AS

    SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

    Authors: Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

    Abstract: While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTS… ▽ More

    Submitted 1 September, 2024; v1 submitted 30 January, 2024; originally announced January 2024.

    Comments: Accepted by Interspeech 2024. An extended version with Appendix. Code: https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics

  18. JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

    Authors: Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari

    Abstract: We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to… ▽ More

    Submitted 9 October, 2023; originally announced October 2023.

  19. arXiv:2309.13509  [pdf, other

    cs.SD eess.AS

    Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Wataru Nakata, Detai Xin, Hiroshi Saruwatari

    Abstract: In text-to-speech, controlling voice characteristics is important in achieving various-purpose speech synthesis. Considering the success of text-conditioned generation, such as text-to-image, free-form text instruction should be useful for intuitive and complicated control of voice characteristics. A sufficiently large corpus of high-quality and diverse voice samples with corresponding free-form d… ▽ More

    Submitted 23 September, 2023; originally announced September 2023.

    Comments: Submitted to ASRU2023

  20. arXiv:2309.09690  [pdf, other

    cs.CL cs.SD eess.AS

    Do learned speech symbols follow Zipf's law?

    Authors: Shinnosuke Takamichi, Hiroki Maeda, Joonyong Park, Daisuke Saito, Hiroshi Saruwatari

    Abstract: In this study, we investigate whether speech symbols, learned through deep learning, follow Zipf's law, akin to natural language symbols. Zipf's law is an empirical law that delineates the frequency distribution of words, forming fundamentals for statistical analysis in natural language processing. Natural language symbols, which are invented by humans to symbolize speech content, are recognized t… ▽ More

    Submitted 18 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  21. arXiv:2309.08127  [pdf, other

    cs.SD eess.AS

    Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

    Authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

    Abstract: This paper proposes a method for extracting a lightweight subset from a text-to-speech (TTS) corpus ensuring synthetic speech quality. In recent years, methods have been proposed for constructing large-scale TTS corpora by collecting diverse data from massive sources such as audiobooks and YouTube. Although these methods have gained significant attention for enhancing the expressive capabilities o… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  22. arXiv:2306.12169  [pdf, other

    cs.HC

    HumanDiffusion: diffusion model using perceptual gradients

    Authors: Yota Ueda, Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Hiroshi Saruwatari

    Abstract: We propose {\it HumanDiffusion,} a diffusion model trained from humans' perceptual gradients to learn an acceptable range of data for humans (i.e., human-acceptable distribution). Conventional HumanGAN aims to model the human-acceptable distribution wider than the real-data distribution by training a neural network-based generator with human-based discriminators. However, HumanGAN training tends t… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: Proceedings of INTERSPEECH

  23. arXiv:2306.00697  [pdf, other

    cs.CL cs.AI eess.AS

    How Generative Spoken Language Modeling Encodes Noisy Speech: Investigation from Phonetics to Syntactics

    Authors: Joonyong Park, Shinnosuke Takamichi, Tomohiko Nakamura, Kentaro Seki, Detai Xin, Hiroshi Saruwatari

    Abstract: We examine the speech modeling potential of generative spoken language modeling (GSLM), which involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis. Since GSLM facilitates textless spoken language processing, exploring its effectiveness is critical for paving the way for novel paradigms in spoken-language processing. This paper presents the finding… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

    Comments: Accepted to INTERSPEECH 2023

  24. arXiv:2305.13724  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    ChatGPT-EDSS: Empathetic Dialogue Speech Synthesis Trained from ChatGPT-derived Context Word Embeddings

    Authors: Yuki Saito, Shinnosuke Takamichi, Eiji Iimori, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose ChatGPT-EDSS, an empathetic dialogue speech synthesis (EDSS) method using ChatGPT for extracting dialogue context. ChatGPT is a chatbot that can deeply understand the content and purpose of an input prompt and appropriately respond to the user's request. We focus on ChatGPT's reading comprehension and introduce it to EDSS, a task of synthesizing speech that can empathize with the interl… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, accepted for INTERSPEECH 2023

  25. arXiv:2305.13713  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    CALLS: Japanese Empathetic Dialogue Speech Corpus of Complaint Handling and Attentive Listening in Customer Center

    Authors: Yuki Saito, Eiji Iimori, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present CALLS, a Japanese speech corpus that considers phone calls in a customer center as a new domain of empathetic spoken dialogue. The existing STUDIES corpus covers only empathetic dialogue between a teacher and student in a school. To extend the application range of empathetic dialogue speech synthesis (EDSS), we designed our corpus to include the same female speaker as the STUDIES teache… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 5 pages, accepted for INTERSPEECH2023

  26. arXiv:2305.12445  [pdf, other

    cs.SD eess.AS

    JNV Corpus: A Corpus of Japanese Nonverbal Vocalizations with Diverse Phrases and Emotions

    Authors: Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present JNV (Japanese Nonverbal Vocalizations) corpus, a corpus of Japanese nonverbal vocalizations (NVs) with diverse phrases and emotions. Existing Japanese NV corpora lack phrase or emotion diversity, which makes it difficult to analyze NVs and support downstream tasks like emotion recognition. We first propose a corpus-design method that contains two phases: (1) collecting NVs phrases based… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

    Comments: 4 pages, 3 figures

  27. arXiv:2305.12442  [pdf, other

    cs.SD eess.AS

    Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

    Authors: Detai Xin, Shinnosuke Takamichi, Ai Morimatsu, Hiroshi Saruwatari

    Abstract: We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wild corpus comprising $3.5$ hours of laughter, which is to our best knowledge the largest laughter corpus designed for laughter synthesis. We then propo… ▽ More

    Submitted 26 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  28. arXiv:2305.00302  [pdf, ps, other

    cs.SD eess.AS

    Environmental sound synthesis from vocal imitations and sound event labels

    Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryotaro Nagase, Takahiro Fukumori, Yoichi Yamashita

    Abstract: One way of expressing an environmental sound is using vocal imitations, which involve the process of replicating or mimicking the rhythm and pitch of sounds by voice. We can effectively express the features of environmental sounds, such as rhythm and pitch, using vocal imitations, which cannot be expressed by conventional input information, such as sound event labels, images, or texts, in an envir… ▽ More

    Submitted 14 September, 2023; v1 submitted 29 April, 2023; originally announced May 2023.

    Comments: Submitted to ICASSP2024

  29. arXiv:2304.12521  [pdf, other

    cs.SD eess.AS

    Foley Sound Synthesis at the DCASE 2023 Challenge

    Authors: Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, Shinosuke Takamichi

    Abstract: The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been produced by human Foley artists, which involves manual recording and mixing of sound. However, recent advances in sound synthesis and generative models have generated interest in machine-assisted or automatic F… ▽ More

    Submitted 28 September, 2023; v1 submitted 24 April, 2023; originally announced April 2023.

    Comments: DCASE 2023 Challenge - Task 7 - Technical Report (Submitted to DCASE 2023 Workshop)

  30. arXiv:2301.12596  [pdf, other

    eess.AS cs.CL

    Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

    Authors: Takaaki Saeki, Soumi Maiti, Xinjian Li, Shinji Watanabe, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: While neural text-to-speech (TTS) has achieved human-like natural synthetic speech, multilingual TTS systems are limited to resource-rich languages due to the need for paired text and studio-quality audio data. This paper proposes a method for zero-shot multilingual TTS using text-only data for the target language. The use of text-only data allows the development of TTS systems for low-resource la… ▽ More

    Submitted 27 May, 2023; v1 submitted 29 January, 2023; originally announced January 2023.

    Comments: To appear in IJCAI 2023

  31. JaCappella Corpus: A Japanese a Cappella Vocal Ensemble Corpus

    Authors: Tomohiko Nakamura, Shinnosuke Takamichi, Naoko Tanji, Satoru Fukayama, Hiroshi Saruwatari

    Abstract: We construct a corpus of Japanese a cappella vocal ensembles (jaCappella corpus) for vocal ensemble separation and synthesis. It consists of 35 copyright-cleared vocal ensemble songs and their audio recordings of individual voice parts. These songs were arranged from out-of-copyright Japanese children's songs and have six voice parts (lead vocal, soprano, alto, tenor, bass, and vocal percussion).… ▽ More

    Submitted 24 February, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

    Comments: Accepted for ICASSP2023

    Journal ref: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Jun. 2023, 5 pages

  32. arXiv:2211.02336  [pdf, other

    cs.SD eess.AS

    Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts

    Authors: Detai Xin, Sharath Adavanne, Federico Ang, Ashish Kulkarni, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a multi-speaker Japanese audiobook text-to-speech (TTS) system that leverages multimodal context information of preceding acoustic context and bilateral textual context to improve the prosody of synthetic speech. Previous work either uses unilateral or single-modality context, which does not fully represent the context information. The proposed method uses an acoustic context encoder an… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

  33. arXiv:2210.14850  [pdf, other

    cs.SD eess.AS

    Text-to-speech synthesis from dark data with evaluation-in-the-loop data selection

    Authors: Kentaro Seki, Shinnosuke Takamichi, Takaaki Saeki, Hiroshi Saruwatari

    Abstract: This paper proposes a method for selecting training data for text-to-speech (TTS) synthesis from dark data. TTS models are typically trained on high-quality speech corpora that cost much time and money for data collection, which makes it very challenging to increase speaker variation. In contrast, there is a large amount of data whose availability is unknown (a.k.a, "dark data"), such as YouTube v… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  34. arXiv:2210.09916  [pdf, other

    cs.SD eess.AS

    Mid-attribute speaker generation using optimal-transport-based interpolation of Gaussian mixture models

    Authors: Aya Watanabe, Shinnosuke Takamichi, Yuki Saito, Detai Xin, Hiroshi Saruwatari

    Abstract: In this paper, we propose a method for intermediating multiple speakers' attributes and diversifying their voice characteristics in ``speaker generation,'' an emerging task that aims to synthesize a nonexistent speaker's naturally sounding voice. The conventional TacoSpawn-based speaker generation method represents the distributions of speaker embeddings by Gaussian mixture models (GMMs) condition… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023. Demo: https://sarulab-speech.github.io/demo_mid-attribute-speaker-generation

  35. arXiv:2210.09815  [pdf, other

    cs.SD eess.AS

    Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-… ▽ More

    Submitted 19 September, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted to SSW12

  36. arXiv:2210.09173  [pdf, other

    cs.SD eess.AS

    Visual onoma-to-wave: environmental sound synthesis from visual onomatopoeias and sound-source images

    Authors: Hien Ohnaka, Shinnosuke Takamichi, Keisuke Imoto, Yuki Okamoto, Kazuki Fujii, Hiroshi Saruwatari

    Abstract: We propose a method for synthesizing environmental sounds from visually represented onomatopoeias and sound sources. An onomatopoeia is a word that imitates a sound structure, i.e., the text representation of sound. From this perspective, onoma-to-wave has been proposed to synthesize environmental sounds from the desired onomatopoeia texts. Onomatopoeias have another representation: visual-text re… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  37. arXiv:2210.07559  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbr… ▽ More

    Submitted 19 September, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    Comments: Accepted to APSIPA ASC 2022

  38. arXiv:2208.07679  [pdf, ps, other

    cs.SD eess.AS

    How Should We Evaluate Synthesized Environmental Sounds

    Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Takahiro Fukumori, Yoichi Yamashita

    Abstract: Although several methods of environmental sound synthesis have been proposed, there has been no discussion on how synthesized environmental sounds should be evaluated. Only either subjective or objective evaluations have been conducted in conventional evaluations, and it is not clear what type of evaluation should be carried out. In this paper, we investigate how to evaluate synthesized environmen… ▽ More

    Submitted 16 August, 2022; originally announced August 2022.

    Comments: Submitted APSIPA ASC 2022

  39. arXiv:2206.10695  [pdf, other

    cs.SD eess.AS

    Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations

    Authors: Detai Xin, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present an emotion recognition system for nonverbal vocalizations (NVs) submitted to the ExVo Few-Shot track of the ICML Expressive Vocalizations Competition 2022. The proposed method uses self-supervised learning (SSL) models to extract features from NVs and uses a classifier chain to model the label dependency between emotions. Experimental results demonstrate that the proposed method can sig… ▽ More

    Submitted 21 June, 2022; originally announced June 2022.

    Comments: Accepted by the ICML Expressive Vocalizations Workshop and Competition 2022

  40. arXiv:2206.08039  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    Acoustic Modeling for End-to-End Empathetic Dialogue Speech Synthesis Using Linguistic and Prosodic Contexts of Dialogue History

    Authors: Yuto Nishimura, Yuki Saito, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We propose an end-to-end empathetic dialogue speech synthesis (DSS) model that considers both the linguistic and prosodic contexts of dialogue history. Empathy is the active attempt by humans to get inside the interlocutor in dialogue, and empathetic DSS is a technology to implement this act in spoken dialogue systems. Our model is conditioned by the history of linguistic and prosody features for… ▽ More

    Submitted 16 June, 2022; originally announced June 2022.

    Comments: 5 pages, 3 figures, Accepted for INTERSPEECH2022

  41. arXiv:2204.10561  [pdf, other

    cs.SD eess.AS

    Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

    Authors: Detai Xin, Shinnosuke Takamichi, Takuma Okamoto, Hisashi Kawai, Hiroshi Saruwatari

    Abstract: This paper presents a speaking-rate-controllable HiFi-GAN neural vocoder. Original HiFi-GAN is a high-fidelity, computationally efficient, and tiny-footprint neural vocoder. We attempt to incorporate a speaking rate control function into HiFi-GAN for improving the accessibility of synthetic speech. The proposed method inserts a differentiable interpolation layer into the HiFi-GAN architecture. A s… ▽ More

    Submitted 22 April, 2022; originally announced April 2022.

    Comments: submitted to INTERSPEECH 2022

  42. arXiv:2204.02152  [pdf, other

    cs.SD eess.AS

    UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

    Authors: Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tes… ▽ More

    Submitted 29 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  43. arXiv:2203.14757  [pdf, ps, other

    cs.SD cs.AI cs.CL cs.HC cs.LG

    STUDIES: Corpus of Japanese Empathetic Dialogue Speech Towards Friendly Voice Agent

    Authors: Yuki Saito, Yuto Nishimura, Shinnosuke Takamichi, Kentaro Tachibana, Hiroshi Saruwatari

    Abstract: We present STUDIES, a new speech corpus for developing a voice agent that can speak in a friendly manner. Humans naturally control their speech prosody to empathize with each other. By incorporating this "empathetic dialogue" behavior into a spoken dialogue system, we can develop a voice agent that can respond to a user more naturally. We designed the STUDIES corpus to include a speaker who speaks… ▽ More

    Submitted 16 June, 2022; v1 submitted 28 March, 2022; originally announced March 2022.

    Comments: 5 pages, 2 figures, Accepted for INTERSPEECH2022, project page: http://sython.org/Corpus/STUDIES

  44. arXiv:2203.14725  [pdf, other

    cs.SD

    vTTS: visual-text to speech

    Authors: Yoshifumi Nakano, Takaaki Saeki, Shinnosuke Takamichi, Katsuhito Sudoh, Hiroshi Saruwatari

    Abstract: This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text.… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: submitted to interspech 2022

  45. arXiv:2203.12937  [pdf, other

    cs.SD eess.AS

    SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Tomohiko Nakamura, Naoko Tanji, Hiroshi Saruwatari

    Abstract: We present a self-supervised speech restoration method without paired speech corpora. Because the previous general speech restoration method uses artificial paired data created by applying various distortions to high-quality speech corpora, it cannot sufficiently represent acoustic distortions of real data, limiting the applicability. Our model consists of analysis, synthesis, and channel modules… ▽ More

    Submitted 27 June, 2022; v1 submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022

  46. arXiv:2203.09961  [pdf, other

    cs.SD eess.AS

    Personalized Filled-pause Generation with Group-wise Prediction Models

    Authors: Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: In this paper, we propose a method to generate personalized filled pauses (FPs) with group-wise prediction models. Compared with fluent text generation, disfluent text generation has not been widely explored. To generate more human-like texts, we addressed disfluent text generation. The usage of disfluency, such as FPs, rephrases, and word fragments, differs from speaker to speaker, and thus, the… ▽ More

    Submitted 22 April, 2022; v1 submitted 18 March, 2022; originally announced March 2022.

    Comments: Accepted to LREC 2022

  47. arXiv:2201.10896  [pdf, other

    cs.SD eess.AS

    J-MAC: Japanese multi-speaker audiobook corpus for speech synthesis

    Authors: Shinnosuke Takamichi, Wataru Nakata, Naoko Tanji, Hiroshi Saruwatari

    Abstract: In this paper, we construct a Japanese audiobook speech corpus called "J-MAC" for speech synthesis research. With the success of reading-style speech synthesis, the research target is shifting to tasks that use complicated contexts. Audiobook speech synthesis is a good example that requires cross-sentence, expressiveness, etc. Unlike reading-style speech, speaker-specific expressiveness in audiobo… ▽ More

    Submitted 26 January, 2022; originally announced January 2022.

  48. arXiv:2112.09323  [pdf, other

    cs.SD eess.AS

    JTubeSpeech: corpus of Japanese speech collected from YouTube for speech recognition and speaker verification

    Authors: Shinnosuke Takamichi, Ludwig Kürzinger, Takaaki Saeki, Sayaka Shiota, Shinji Watanabe

    Abstract: In this paper, we construct a new Japanese speech corpus called "JTubeSpeech." Although recent end-to-end learning requires large-size speech corpora, open-sourced such corpora for languages other than English have not yet been established. In this paper, we describe the construction of a corpus from YouTube videos and subtitles for speech recognition and speaker verification. Our method can autom… ▽ More

    Submitted 17 December, 2021; originally announced December 2021.

    Comments: Submitted to ICASSP2022

  49. arXiv:2110.07840  [pdf, other

    cs.CL cs.SD eess.AS

    ESPnet2-TTS: Extending the Edge of TTS Research

    Authors: Tomoki Hayashi, Ryuichi Yamamoto, Takenori Yoshimura, Peter Wu, Jiatong Shi, Takaaki Saeki, Yooncheol Ju, Yusuke Yasuda, Shinnosuke Takamichi, Shinji Watanabe

    Abstract: This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-the-fly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E text-to-waveform modeling, which simplify the training pipeline and further enhance T… ▽ More

    Submitted 14 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP2022. Demo HP: https://espnet.github.io/icassp2022-tts/

  50. arXiv:2109.10724  [pdf, other

    cs.SD cs.CL eess.AS

    Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context Prediction Network

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: Incremental text-to-speech (TTS) synthesis generates utterances in small linguistic units for the sake of real-time and low-latency applications. We previously proposed an incremental TTS method that leverages a large pre-trained language model to take unobserved future context into account without waiting for the subsequent segment. Although this method achieves comparable speech quality to that… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU2021