Skip to main content

Showing 51–70 of 70 results for author: Takamichi, S

.
  1. arXiv:2102.05872  [pdf, ps, other

    cs.SD eess.AS

    Onoma-to-wave: Environmental sound synthesis from onomatopoeic words

    Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, Yoichi Yamashita

    Abstract: In this paper, we propose a framework for environmental sound synthesis from onomatopoeic words. As one way of expressing an environmental sound, we can use an onomatopoeic word, which is a character sequence for phonetically imitating a sound. An onomatopoeic word is effective for describing diverse sound features. Therefore, using onomatopoeic words for environmental sound synthesis will enable… ▽ More

    Submitted 7 February, 2022; v1 submitted 11 February, 2021; originally announced February 2021.

    Comments: Accepted to APSIPA Transactions on Signal and Information Processing

  2. arXiv:2102.04051  [pdf, other

    cs.HC cs.LG cs.SD eess.AS

    HumanACGAN: conditional generative adversarial network with human-based auxiliary classifier and its evaluation in phoneme perception

    Authors: Yota Ueda, Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari

    Abstract: We propose a conditional generative adversarial network (GAN) incorporating humans' perceptual evaluations. A deep neural network (DNN)-based generator of a GAN can represent a real-data distribution accurately but can never represent a human-acceptable distribution, which are ranges of data in which humans accept the naturalness regardless of whether the data are real or not. A HumanGAN was propo… ▽ More

    Submitted 8 February, 2021; originally announced February 2021.

    Comments: 5 pages, 6 figures, to be published in 2021 IEEE International Conference on Acoustics, Speech and Signal Processing

  3. Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model

    Authors: Takaaki Saeki, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: This letter presents an incremental text-to-speech (TTS) method that performs synthesis in small linguistic units while maintaining the naturalness of output speech. Incremental TTS is generally subject to a trade-off between latency and synthetic speech quality. It is challenging to produce high-quality speech with a low-latency setup that does not make much use of an unobserved future sentence (… ▽ More

    Submitted 14 April, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

    Comments: Accepted for IEEE Signal Processing Letters

  4. arXiv:2010.01793  [pdf, other

    eess.AS cs.SD

    JSSS: free Japanese speech corpus for summarization and simplification

    Authors: Shinnosuke Takamichi, Mamoru Komachi, Naoko Tanji, Hiroshi Saruwatari

    Abstract: In this paper, we construct a new Japanese speech corpus for speech-based summarization and simplification, "JSSS" (pronounced "j-triple-s"). Given the success of reading-style speech synthesis from short-form sentences, we aim to design more difficult tasks for delivering information to humans. Our corpus contains voices recorded for two tasks that have a role in providing information under const… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

  5. arXiv:2007.04719  [pdf, ps, other

    cs.SD eess.AS

    RWCP-SSD-Onomatopoeia: Onomatopoeic Word Dataset for Environmental Sound Synthesis

    Authors: Yuki Okamoto, Keisuke Imoto, Shinnosuke Takamichi, Ryosuke Yamanishi, Takahiro Fukumori, Yoichi Yamashita

    Abstract: Environmental sound synthesis is a technique for generating a natural environmental sound. Conventional work on environmental sound synthesis using sound event labels cannot finely control synthesized sounds, for example, the pitch and timbre. We consider that onomatopoeic words can be used for environmental sound synthesis. Onomatopoeic words are effective for explaining the feature of sounds. We… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

    Comments: Submitted to DCASE2020 workshop

  6. arXiv:2006.02959  [pdf, other

    cs.SD eess.AS

    PJS: phoneme-balanced Japanese singing voice corpus

    Authors: Junya Koguchi, Shinnosuke Takamichi

    Abstract: This paper presents a free Japanese singing voice corpus that can be used for highly applicable and reproducible singing voice synthesis research. A singing voice corpus helps develop singing voice synthesis, but existing corpora have two critical problems: data imbalance (singing voice corpora do not guarantee phoneme balance, unlike speaking-voice corpora) and copyright issues (cannot legally sh… ▽ More

    Submitted 4 June, 2020; originally announced June 2020.

  7. arXiv:2002.06778  [pdf, other

    cs.SD eess.AS

    Lifter Training and Sub-band Modeling for Computationally Efficient and High-Quality Voice Conversion Using Spectral Differentials

    Authors: Takaaki Saeki, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: In this paper, we propose computationally efficient and high-quality methods for statistical voice conversion (VC) with direct waveform modification based on spectral differentials. The conventional method with a minimum-phase filter achieves high-quality conversion but requires heavy computation in filtering. This is because the minimum phase using a fixed lifter of the Hilbert transform often re… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

    Comments: 5 pages, to appear in IEEE International Conference on Acoustics, Speech, and Signal Processing 2020 (ICASSP 2020)

  8. arXiv:2001.07044  [pdf, ps, other

    cs.SD eess.AS

    JVS-MuSiC: Japanese multispeaker singing-voice corpus

    Authors: Hiroki Tamaru, Shinnosuke Takamichi, Naoko Tanji, Hiroshi Saruwatari

    Abstract: Thanks to developments in machine learning techniques, it has become possible to synthesize high-quality singing voices of a single singer. An open multispeaker singing-voice corpus would further accelerate the research in singing-voice synthesis. However, conventional singing-voice corpora only consist of the singing voices of a single singer. We designed a Japanese multispeaker singing-voice cor… ▽ More

    Submitted 20 January, 2020; originally announced January 2020.

  9. arXiv:1909.11391  [pdf, ps, other

    cs.SD cs.NE eess.AS

    HumanGAN: generative adversarial network with human-based discriminator and its evaluation in speech perception modeling

    Authors: Kazuki Fujii, Yuki Saito, Shinnosuke Takamichi, Yukino Baba, Hiroshi Saruwatari

    Abstract: We propose the HumanGAN, a generative adversarial network (GAN) incorporating human perception as a discriminator. A basic GAN trains a generator to represent a real-data distribution by fooling the discriminator that distinguishes real and generated data. Therefore, the basic GAN cannot represent the outside of a real-data distribution. In the case of speech perception, humans can recognize not o… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

    Comments: Submitted to IEEE ICASSP 2020

  10. arXiv:1908.10055  [pdf, ps, other

    cs.SD eess.AS

    Overview of Tasks and Investigation of Subjective Evaluation Methods in Environmental Sound Synthesis and Conversion

    Authors: Yuki Okamoto, Keisuke Imoto, Tatsuya Komatsu, Shinnosuke Takamichi, Takumi Yagyu, Ryosuke Yamanishi, Yoichi Yamashita

    Abstract: Synthesizing and converting environmental sounds have the potential for many applications such as supporting movie and game production, data augmentation for sound event detection and scene classification. Conventional works on synthesizing and converting environmental sounds are based on a physical modeling or concatenative approach. However, there are a limited number of works that have addresse… ▽ More

    Submitted 27 August, 2019; originally announced August 2019.

  11. arXiv:1908.06248  [pdf, other

    cs.SD eess.AS

    JVS corpus: free Japanese multi-speaker voice corpus

    Authors: Shinnosuke Takamichi, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, Hiroshi Saruwatari

    Abstract: Thanks to improvements in machine learning techniques, including deep learning, speech synthesis is becoming a machine learning task. To accelerate speech synthesis research, we are developing Japanese voice corpora reasonably accessible from not only academic institutions but also commercial companies. In 2017, we released the JSUT corpus, which contains 10 hours of reading-style speech uttered b… ▽ More

    Submitted 17 August, 2019; originally announced August 2019.

  12. arXiv:1908.01454  [pdf, ps, other

    cs.SD cs.CR cs.LG eess.AS

    V2S attack: building DNN-based voice conversion from automatic speaker verification

    Authors: Taiki Nakamura, Yuki Saito, Shinnosuke Takamichi, Yusuke Ijima, Hiroshi Saruwatari

    Abstract: This paper presents a new voice impersonation attack using voice conversion (VC). Enrolling personal voices for automatic speaker verification (ASV) offers natural and flexible biometric authentication systems. Basically, the ASV systems do not include the users' voice data. However, if the ASV system is unexpectedly exposed and hacked by a malicious attacker, there is a risk that the attacker wil… ▽ More

    Submitted 4 August, 2019; originally announced August 2019.

    Comments: 5 pages, 2 figures, accepted for The 10th ISCA Speech Synthesis Workshop (SSW10)

  13. arXiv:1907.08294  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    DNN-based Speaker Embedding Using Subjective Inter-speaker Similarity for Multi-speaker Modeling in Speech Synthesis

    Authors: Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: This paper proposes novel algorithms for speaker embedding using subjective inter-speaker similarity based on deep neural networks (DNNs). Although conventional DNN-based speaker embedding such as a $d$-vector can be applied to multi-speaker modeling in speech synthesis, it does not correlate with the subjective inter-speaker similarity and is not necessarily appropriate speaker representation for… ▽ More

    Submitted 19 July, 2019; originally announced July 2019.

    Comments: 6 pages, 7 figures, accepted for The 10th ISCA Speech Synthesis Workshop (SSW10)

  14. arXiv:1902.03389  [pdf, ps, other

    cs.SD cs.AI cs.LG cs.MM cs.NE eess.AS

    Generative Moment Matching Network-based Random Modulation Post-filter for DNN-based Singing Voice Synthesis and Neural Double-tracking

    Authors: Hiroki Tamaru, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

    Abstract: This paper proposes a generative moment matching network (GMMN)-based post-filter that provides inter-utterance pitch variation for deep neural network (DNN)-based singing voice synthesis. The natural pitch variation of a human singing voice leads to a richer musical experience and is used in double-tracking, a recording method in which two performances of the same phrase are recorded and mixed to… ▽ More

    Submitted 9 February, 2019; originally announced February 2019.

    Comments: 5 pages, to appear in IEEE ICASSP 2019 (Paper Code: SLP-P22.11, Session: Speech Synthesis III)

  15. arXiv:1807.03474  [pdf, ps, other

    cs.SD eess.AS

    Phase reconstruction from amplitude spectrograms based on von-Mises-distribution deep neural network

    Authors: Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Daichi Kitamura, Hiroshi Saruwatari

    Abstract: This paper presents a deep neural network (DNN)-based phase reconstruction from amplitude spectrograms. In audio signal and speech processing, the amplitude spectrogram is often used for processing, and the corresponding phase spectrogram is reconstructed from the amplitude spectrogram on the basis of the Griffin-Lim method. However, the Griffin-Lim method causes unnatural artifacts in synthetic s… ▽ More

    Submitted 10 July, 2018; originally announced July 2018.

    Comments: To appear in the Proc. of IWAENC2018

  16. arXiv:1806.10307  [pdf, other

    eess.AS cs.SD

    Independent Deeply Learned Matrix Analysis for Multichannel Audio Source Separation

    Authors: Shinichi Mogami, Hayato Sumino, Daichi Kitamura, Norihiro Takamune, Shinnosuke Takamichi, Hiroshi Saruwatari, Nobutaka Ono

    Abstract: In this paper, we address a multichannel audio source separation task and propose a new efficient method called independent deeply learned matrix analysis (IDLMA). IDLMA estimates the demixing matrix in a blind manner and updates the time-frequency structures of each source using a pretrained deep neural network (DNN). Also, we introduce a complex Student's t-distribution as a generalized source g… ▽ More

    Submitted 27 June, 2018; originally announced June 2018.

    Comments: 5 pages, 4 figures, To appear in the Proceedings of the 26th European Signal Processing Conference (EUSIPCO 2018)

  17. arXiv:1711.00354  [pdf, ps, other

    cs.CL

    JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

    Authors: Ryosuke Sonobe, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end… ▽ More

    Submitted 28 October, 2017; originally announced November 2017.

    Comments: Submitted to ICASSP2018

  18. arXiv:1709.08041  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

    Authors: Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: A method for statistical parametric speech synthesis incorporating generative adversarial networks (GANs) is proposed. Although powerful deep neural networks (DNNs) techniques can be applied to artificially synthesize speech waveform, the synthetic speech quality is low compared with that of natural speech. One of the issues causing the quality degradation is an over-smoothing effect often observe… ▽ More

    Submitted 23 September, 2017; originally announced September 2017.

    Comments: Preprint manuscript of IEEE/ACM Transactions on Audio, Speech and Language Processing

  19. arXiv:1704.03626  [pdf, ps, other

    cs.SD cs.LG stat.ML

    Sampling-based speech parameter generation using moment-matching networks

    Authors: Shinnosuke Takamichi, Tomoki Koriyama, Hiroshi Saruwatari

    Abstract: This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variation i… ▽ More

    Submitted 12 April, 2017; originally announced April 2017.

    Comments: Submitted to INTERSPEECH 2017

  20. arXiv:1704.02360  [pdf, ps, other

    cs.SD cs.CL cs.LG

    Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

    Authors: Hiroyuki Miyoshi, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari

    Abstract: Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior probabilities estimated from the source speech parameters. Although conventional VC can be built from non-parallel data, it is difficult to convert speaker individuality su… ▽ More

    Submitted 6 August, 2017; v1 submitted 10 April, 2017; originally announced April 2017.

    Comments: Accepted to INTERSPEECH 2017