-
QuickVC: Any-to-many Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion
Authors:
Houjian Guo,
Chaoran Liu,
Carlos Toshinori Ishi,
Hiroshi Ishiguro
Abstract:
With the development of automatic speech recognition (ASR) and text-to-speech (TTS) technology, high-quality voice conversion (VC) can be achieved by extracting source content information and target speaker information to reconstruct waveforms. However, current methods still require improvement in terms of inference speed. In this study, we propose a lightweight VITS-based VC model that uses the H…
▽ More
With the development of automatic speech recognition (ASR) and text-to-speech (TTS) technology, high-quality voice conversion (VC) can be achieved by extracting source content information and target speaker information to reconstruct waveforms. However, current methods still require improvement in terms of inference speed. In this study, we propose a lightweight VITS-based VC model that uses the HuBERT-Soft model to extract content information features without speaker information. Through subjective and objective experiments on synthesized speech, the proposed model demonstrates competitive results in terms of naturalness and similarity. Importantly, unlike the original VITS model, we use the inverse short-time Fourier transform (iSTFT) to replace the most computationally expensive part. Experimental results show that our model can generate samples at over 5000 kHz on the 3090 GPU and over 250 kHz on the i9-10900K CPU, achieving competitive speed for the same hardware configuration.
△ Less
Submitted 23 February, 2023; v1 submitted 16 February, 2023;
originally announced February 2023.
-
CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model with Transformer
Authors:
Changzeng Fu,
Chaoran Liu,
Carlos Toshinori Ishi,
Hiroshi Ishiguro
Abstract:
In this study, we explore the transformer's ability to capture intra-relations among frames by augmenting the receptive field of models. Concretely, we propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task. In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from th…
▽ More
In this study, we explore the transformer's ability to capture intra-relations among frames by augmenting the receptive field of models. Concretely, we propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task. In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech. The proposed method was evaluated on the Japanese emotional speech dataset and compared to several baselines (ACVAE, CycleGAN) with objective and subjective evaluations. The results show that our proposed model is able to convert emotion with higher strength and quality.
△ Less
Submitted 30 November, 2021;
originally announced November 2021.
-
SeMemNN: A Semantic Matrix-Based Memory Neural Network for Text Classification
Authors:
Changzeng Fu,
Chaoran Liu,
Carlos Toshinori Ishi,
Yuichiro Yoshikawa,
Hiroshi Ishiguro
Abstract:
Text categorization is the task of assigning labels to documents written in a natural language, and it has numerous real-world applications including sentiment analysis as well as traditional topic assignment tasks. In this paper, we propose 5 different configurations for the semantic matrix-based memory neural network with end-to-end learning manner and evaluate our proposed method on two corpora…
▽ More
Text categorization is the task of assigning labels to documents written in a natural language, and it has numerous real-world applications including sentiment analysis as well as traditional topic assignment tasks. In this paper, we propose 5 different configurations for the semantic matrix-based memory neural network with end-to-end learning manner and evaluate our proposed method on two corpora of news articles (AG news, Sogou news). The best performance of our proposed method outperforms the baseline VDCNN models on the text classification task and gives a faster speed for learning semantics. Moreover, we also evaluate our model on small scale datasets. The results show that our proposed method can still achieve better results in comparison to VDCNN on the small scale dataset. This paper is to appear in the Proceedings of the 2020 IEEE 14th International Conference on Semantic Computing (ICSC 2020), San Diego, California, 2020.
△ Less
Submitted 3 March, 2020;
originally announced March 2020.