Skip to main content

Showing 1–13 of 13 results for author: Koriyama, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2507.03912  [pdf, ps, other

    eess.AS cs.SD

    Prosody Labeling with Phoneme-BERT and Speech Foundation Models

    Authors: Tomoki Koriyama

    Abstract: This paper proposes a model for automatic prosodic label annotation, where the predicted labels can be used for training a prosody-controllable text-to-speech model. The proposed model utilizes not only rich acoustic features extracted by a self-supervised-learning (SSL)-based model or a Whisper encoder, but also linguistic features obtained from phoneme-input pretrained linguistic foundation mode… ▽ More

    Submitted 5 July, 2025; originally announced July 2025.

    Comments: Accepted to Speech Synthesis Workshop 2025 (SSW13)

  2. arXiv:2507.03382  [pdf, ps, other

    cs.SD eess.AS

    Speaker-agnostic Emotion Vector for Cross-speaker Emotion Intensity Control

    Authors: Masato Murata, Koichi Miyazaki, Tomoki Koriyama

    Abstract: Cross-speaker emotion intensity control aims to generate emotional speech of a target speaker with desired emotion intensities using only their neutral speech. A recently proposed method, emotion arithmetic, achieves emotion intensity control using a single-speaker emotion vector. Although this prior method has shown promising results in the same-speaker setting, it lost speaker consistency in the… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: Accepted by INTERSPEECH 2025

  3. arXiv:2507.03377  [pdf, ps, other

    cs.SD eess.AS

    Eigenvoice Synthesis based on Model Editing for Speaker Generation

    Authors: Masato Murata, Koichi Miyazaki, Tomoki Koriyama, Tomoki Toda

    Abstract: Speaker generation task aims to create unseen speaker voice without reference speech. The key to the task is defining a speaker space that represents diverse speakers to determine the generated speaker trait. However, the effective way to define this speaker space remains unclear. Eigenvoice synthesis is one of the promising approaches in the traditional parametric synthesis framework, such as HMM… ▽ More

    Submitted 4 July, 2025; originally announced July 2025.

    Comments: Accepted by INTERSPEECH 2025

  4. VAE-based Phoneme Alignment Using Gradient Annealing and SSL Acoustic Features

    Authors: Tomoki Koriyama

    Abstract: This paper presents an accurate phoneme alignment model that aims for speech analysis and video content creation. We propose a variational autoencoder (VAE)-based alignment model in which a probable path is searched using encoded acoustic and linguistic embeddings in an unsupervised manner. Our proposed model is based on one TTS alignment (OTA) and extended to obtain phoneme boundaries. Specifical… ▽ More

    Submitted 25 September, 2024; v1 submitted 2 July, 2024; originally announced July 2024.

    Comments: Proceedings of Interspeech 2024

  5. arXiv:2407.00766  [pdf, other

    cs.SD eess.AS

    An Attribute Interpolation Method in Speech Synthesis by Model Merging

    Authors: Masato Murata, Koichi Miyazaki, Tomoki Koriyama

    Abstract: With the development of speech synthesis, recent research has focused on challenging tasks, such as speaker generation and emotion intensity control. Attribute interpolation is a common approach to these tasks. However, most previous methods for attribute interpolation require specific modules or training methods. We propose an attribute interpolation method in speech synthesis by model merging. M… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Accepted by INTERSPEECH 2024

  6. arXiv:2402.00288  [pdf, other

    eess.AS cs.SD

    Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito

    Abstract: Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and in… ▽ More

    Submitted 14 June, 2024; v1 submitted 31 January, 2024; originally announced February 2024.

    Comments: Accepted by INTERSPEECH2024

  7. arXiv:2302.13652  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD

    Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech

    Authors: Dong Yang, Tomoki Koriyama, Yuki Saito, Takaaki Saeki, Detai Xin, Hiroshi Saruwatari

    Abstract: Pause insertion, also known as phrase break prediction and phrasing, is an essential part of TTS systems because proper pauses with natural duration significantly enhance the rhythm and intelligibility of synthetic speech. However, conventional phrasing models ignore various speakers' different styles of inserting silent pauses, which can degrade the performance of the model trained on a multi-spe… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

    Comments: Accepted by ICASSP2023

  8. arXiv:2210.17098  [pdf, other

    cs.SD cs.LG eess.AS

    Structured State Space Decoder for Speech Recognition and Synthesis

    Authors: Koichi Miyazaki, Masato Murata, Tomoki Koriyama

    Abstract: Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: Submitted to ICASSP 2023

  9. arXiv:2204.02152  [pdf, other

    cs.SD eess.AS

    UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

    Authors: Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: We present the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022. The challenge is to predict the MOS values of speech samples collected from previous Blizzard Challenges and Voice Conversion Challenges for two tracks: a main track for in-domain prediction and an out-of-domain (OOD) track for which there is less labeled data from different listening tes… ▽ More

    Submitted 29 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Accepted to INTERSPEECH 2022

  10. arXiv:2008.02950  [pdf, ps, other

    eess.AS cs.LG cs.SD

    Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes

    Authors: Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari

    Abstract: Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model. Although many approaches using deep neural networks (DNNs) have been proposed, DNNs are prone to overfitting when the amount of training data is limited. We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of Bayesian ker… ▽ More

    Submitted 6 August, 2020; originally announced August 2020.

    Comments: 5 pages, accepted for INTERSPEECH 2020

  11. arXiv:2004.10823  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Utterance-level Sequential Modeling For Deep Gaussian Process Based Speech Synthesis Using Simple Recurrent Unit

    Authors: Tomoki Koriyama, Hiroshi Saruwatari

    Abstract: This paper presents a deep Gaussian process (DGP) model with a recurrent architecture for speech sequence modeling. DGP is a Bayesian deep model that can be trained effectively with the consideration of model complexity and is a kernel regression model that can have high expressibility. In the previous studies, it was shown that the DGP-based speech synthesis outperformed neural network-based one,… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

    Comments: 5 pages. Accepted by ICASSP2020

  12. arXiv:1908.06248  [pdf, other

    cs.SD eess.AS

    JVS corpus: free Japanese multi-speaker voice corpus

    Authors: Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, Hiroshi Saruwatari

    Abstract: Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered b… ▽ More

    Submitted 17 August, 2019; originally announced August 2019.

  13. arXiv:1902.03389  [pdf, ps, other

    cs.SD cs.AI cs.LG cs.MM cs.NE eess.AS

    Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-tracking

    Authors: Hiroki Tamaru, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

    Abstract: This paper proposes a generative moment matching network (GMMN)-based post-filter that provides inter-utterance pitch variation for deep neural network (DNN)-based singing voice synthesis. The natural pitch variation of a human singing voice leads to a richer musical experience and is used in double-tracking, a recording method in which two performances of the same phrase are recorded and mixed to… ▽ More

    Submitted 9 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: SLP-P22.11, Session: Speech Synthesis III)