Skip to main content

Showing 1–34 of 34 results for author: Tanaka, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2412.08343  [pdf, other

    cs.GR cs.SD eess.AS

    SyncViolinist: Music-Oriented Violin Motion Generation Based on Bowing and Fingering

    Authors: Hiroki Nishizawa, Keitaro Tanaka, Asuka Hirata, Shugo Yamaguchi, Qi Feng, Masatoshi Hamanaka, Shigeo Morishima

    Abstract: Automatically generating realistic musical performance motion can greatly enhance digital media production, often involving collaboration between professionals and musicians. However, capturing the intricate body, hand, and finger movements required for accurate musical performances is challenging. Existing methods often fall short due to the complex mapping between audio and motion, typically req… ▽ More

    Submitted 11 December, 2024; originally announced December 2024.

    Comments: 10 pages, 7 figures, 6 tables, WACV 2025

  2. arXiv:2409.02245  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

    Abstract: Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted to Interspeech 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/

  3. arXiv:2403.16464  [pdf, other

    cs.SD cs.LG eess.AS

    Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka

    Abstract: A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solutio… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted to ICASSP 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/

  4. arXiv:2312.16852  [pdf, other

    cs.LG cs.HC eess.SP

    Sensor Data Simulation for Anomaly Detection of the Elderly Living Alone

    Authors: Kai Tanaka, Mineichi Kudo, Keigo Kimura

    Abstract: With the increase of the number of elderly people living alone around the world, there is a growing demand for sensor-based detection of anomalous behaviors. Although smart homes with ambient sensors could be useful for detecting such anomalies, there is a problem of lack of sufficient real data for developing detection algorithms. For coping with this problem, several sensor data simulators have… ▽ More

    Submitted 28 December, 2023; originally announced December 2023.

    Comments: 26 pages, 10 figures

    Journal ref: IEEE Internet of Things Journal, 11-19 (2024), pp. 31675-31686

  5. arXiv:2308.07117  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: The inverse short-time Fourier transform network (iSTFTNet) has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet2/

  6. arXiv:2306.06495  [pdf, other

    eess.AS cs.SD

    Audio-Visual Speech Enhancement With Selective Off-Screen Speech Extraction

    Authors: Tomoya Yoshinaga, Keitaro Tanaka, Shigeo Morishima

    Abstract: This paper describes an audio-visual speech enhancement (AV-SE) method that estimates from noisy input audio a mixture of the speech of the speaker appearing in an input video (on-screen target speech) and of a selected speaker not appearing in the video (off-screen target speech). Although conventional AV-SE methods have suppressed all off-screen sounds, it is necessary to listen to a specific pr… ▽ More

    Submitted 10 June, 2023; originally announced June 2023.

    Comments: Accepted by EUSIPCO 2023

  7. Improving the Gap in Visual Speech Recognition Between Normal and Silent Speech Based on Metric Learning

    Authors: Sara Kashiwagi, Keitaro Tanaka, Qi Feng, Shigeo Morishima

    Abstract: This paper presents a novel metric learning approach to address the performance gap between normal and silent speech in visual speech recognition (VSR). The difference in lip movements between the two poses a challenge for existing VSR models, which exhibit degraded accuracy when applied to silent speech. To solve this issue and tackle the scarcity of training data for silent speech, we propose to… ▽ More

    Submitted 16 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

    Comments: Accepted by INTERSPEECH 2023

  8. arXiv:2303.13909  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminato… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/

  9. arXiv:2302.12482  [pdf, other

    eess.IV cs.CV

    Disease Severity Regression with Continuous Data Augmentation

    Authors: Shumpei Takezaki, Kiyohito Tanaka, Seiichi Uchida, Takeaki Kadota

    Abstract: Disease severity regression by a convolutional neural network (CNN) for medical images requires a sufficient number of image samples labeled with severity levels. Conditional generative adversarial network (cGAN)-based data augmentation (DA) is a possible solution, but it encounters two issues. The first issue is that existing cGANs cannot deal with real-valued severity levels as their conditions,… ▽ More

    Submitted 24 February, 2023; originally announced February 2023.

    Comments: Accepted at ISBI2023

  10. arXiv:2206.06533  [pdf, other

    cs.CV cs.RO eess.IV

    3D scene reconstruction from monocular spherical video with motion parallax

    Authors: Kenji Tanaka

    Abstract: In this paper, we describe a method to capture nearly entirely spherical (360 degree) depth information using two adjacent frames from a single spherical video with motion parallax. After illustrating a spherical depth information retrieval using two spherical cameras, we demonstrate monocular spherical stereo by using stabilized first-person video footage. Experiments demonstrated that the depth… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: 13 pages, 18 figures

    ACM Class: I.4.1; I.4.5

  11. arXiv:2204.11789  [pdf, ps, other

    quant-ph cs.MA cs.RO eess.SY stat.CO

    Travel time optimization on multi-AGV routing by reverse annealing

    Authors: Renichiro Haba, Masayuki Ohzeki, Kazuyuki Tanaka

    Abstract: Quantum annealing has been actively researched since D-Wave Systems produced the first commercial machine in 2011. Controlling a large fleet of automated guided vehicles is one of the real-world applications utilizing quantum annealing. In this study, we propose a formulation to control the traveling routes to minimize the travel time. We validate our formulation through simulation in a virtual pl… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: 11 pages, 5 figures, 1 table

    Journal ref: Scientific Reports, 12(1), 17753 (2022)

  12. arXiv:2203.02395  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

    Authors: Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki

    Abstract: In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

    Comments: Accepted to ICASSP 2022. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/

  13. Realistic Endoscopic Image Generation Method Using Virtual-to-real Image-domain Translation

    Authors: Masahiro Oda, Kiyohito Tanaka, Hirotsugu Takabatake, Masaki Mori, Hiroshi Natori, Kensaku Mori

    Abstract: This paper proposes a realistic image generation method for visualization in endoscopic simulation systems. Endoscopic diagnosis and treatment are performed in many hospitals. To reduce complications related to endoscope insertions, endoscopic simulation systems are used for training or rehearsal of endoscope insertions. However, current simulation systems generate non-realistic virtual endoscopic… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

    Comments: Accepted paper as an oral presentation at the Joint MICCAI workshop MIAR | AE-CAI | CARE 2019

    Journal ref: Healthcare Technology Letters, Vol.6, No.6, pp.214-219, 2019

  14. Depth Estimation from Single-shot Monocular Endoscope Image Using Image Domain Adaptation And Edge-Aware Depth Estimation

    Authors: Masahiro Oda, Hayato Itoh, Kiyohito Tanaka, Hirotsugu Takabatake, Masaki Mori, Hiroshi Natori, Kensaku Mori

    Abstract: We propose a depth estimation method from a single-shot monocular endoscopic image using Lambertian surface translation by domain adaptation and depth estimation using multi-scale edge loss. We employ a two-step estimation process including Lambertian surface translation from unpaired data and depth estimation. The texture and specular reflection on the surface of an organ reduce the accuracy of d… ▽ More

    Submitted 12 January, 2022; originally announced January 2022.

    Comments: Accepted paper as an oral presentation at Joint MICCAI workshop 2021, AE-CAI/CARE/OR2.0

    Journal ref: Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 2021

  15. Order-Guided Disentangled Representation Learning for Ulcerative Colitis Classification with Limited Labels

    Authors: Shota Harada, Ryoma Bise, Hideaki Hayashi, Kiyohito Tanaka, Seiichi Uchida

    Abstract: Ulcerative colitis (UC) classification, which is an important task for endoscopic diagnosis, involves two main difficulties. First, endoscopic images with the annotation about UC (positive or negative) are usually limited. Second, they show a large variability in their appearance due to the location in the colon. Especially, the second difficulty prevents us from using existing semi-supervised lea… ▽ More

    Submitted 2 March, 2023; v1 submitted 6 November, 2021; originally announced November 2021.

    Comments: Accepted by MICCAI 2021

  16. Manifold-Aware Deep Clustering: Maximizing Angles between Embedding Vectors Based on Regular Simplex

    Authors: Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi

    Abstract: This paper presents a new deep clustering (DC) method called manifold-aware DC (M-DC) that can enhance hyperspace utilization more effectively than the original DC. The original DC has a limitation in that a pair of two speakers has to be embedded having an orthogonal relationship due to its use of the one-hot vector-based loss function, while our method derives a unique loss function aimed at max… ▽ More

    Submitted 16 October, 2023; v1 submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted by Interspeech 2021

  17. arXiv:2104.06900  [pdf, ps, other

    cs.SD eess.AS

    FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

    Abstract: This paper proposes a non-autoregressive extension of our previously proposed sequence-to-sequence (S2S) model-based voice conversion (VC) methods. S2S model-based VC methods have attracted particular attention in recent years for their flexibility in converting not only the voice identity but also the pitch contour and local duration of input speech, thanks to the ability of the encoder-decoder a… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  18. arXiv:2102.12841  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion d… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

  19. arXiv:2010.11672  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-sp… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to Interspeech 2020. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html

  20. arXiv:2010.02977  [pdf, ps, other

    cs.SD eess.AS

    VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki

    Abstract: In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predic… ▽ More

    Submitted 9 March, 2024; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: For more details on the baseline method used for comparison, please refer to our article in arXiv:2008.12604

  21. arXiv:2008.12604  [pdf, ps, other

    eess.AS stat.ML

    Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: We previously proposed a method that allows for nonparallel voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN. The main features of our method, called StarGAN-VC, are as follows: First, it requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training. Second, it can simultaneously learn mappings across mu… ▽ More

    Submitted 10 November, 2020; v1 submitted 27 August, 2020; originally announced August 2020.

    Comments: Submitted to IEEE/ACM Trans. ASLP. This paper is an extended full-paper version of arXiv:1806.02169

  22. arXiv:2005.08445  [pdf, ps, other

    eess.AS cs.SD stat.ML

    Many-to-Many Voice Transformer Network

    Authors: Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda

    Abstract: This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously proposed an S2S-based VC method using a transformer network architecture called the voice transformer network (VTN). The original VTN was designed to learn only a m… ▽ More

    Submitted 6 November, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to IEEE/ACM Trans. ASLP. Please also refer to our related article: arXiv:1811.01609

  23. arXiv:2003.12913  [pdf, other

    eess.SY eess.SP

    Beamformed mmWave System Propagation at 60GHz in an Office Environment

    Authors: Syed Hashim Ali Shah, Sarankumar Balakrishnan, Liangxiao Xin, Mohamed Abouelseoud, Kazuyuki Sakoda, Ken Tanaka, Christopher Slezak, Sundeep Rangan, Shivendra Panwar

    Abstract: Millimeter wave wireless systems rely heavily on directional communication in narrow steerable beams. Tools to measure the spatial and temporal nature of the channel are necessary to evaluate beamforming and related algorithms. This paper presents a novel 60~GHz phased-array based directional channel sounder and data analysis procedure that can accurately extract paths and their transmit and recei… ▽ More

    Submitted 28 March, 2020; originally announced March 2020.

    Comments: This paper has been accepted for presentation at IEEE ICC 2020 in Ireland, Dublin from June 7 to June 11 2020

  24. arXiv:1911.12906  [pdf

    eess.IV cs.CV

    Enhancing Passive Non-Line-of-Sight Imaging Using Polarization Cues

    Authors: Kenichiro Tanaka, Yasuhiro Mukaigawa, Achuta Kadambi

    Abstract: This paper presents a method of passive non-line-of-sight (NLOS) imaging using polarization cues. A key observation is that the oblique light has a different polarimetric signal. It turns out this effect is due to the polarization axis rotation, a phenomena which can be used to better condition the light transport matrix for non-line-of-sight imaging. Our analysis and results show that the use of… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

  25. arXiv:1911.01601  [pdf, other

    eess.AS cs.CR cs.SD eess.SP

    ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

    Authors: Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Hector Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sebastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika , et al. (15 additional authors not shown)

    Abstract: Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to imperso… ▽ More

    Submitted 14 July, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

    Comments: Accepted, Computer Speech and Language. This manuscript version is made available under the CC-BY-NC-ND 4.0. For the published version on Elsevier website, please visit https://doi.org/10.1016/j.csl.2020.101114

  26. arXiv:1907.12279  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. H… ▽ More

    Submitted 7 August, 2019; v1 submitted 29 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/stargan-vc2/index.html

  27. arXiv:1904.04631  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time ali… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html

  28. arXiv:1904.02892  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones… ▽ More

    Submitted 8 April, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted to INTERSPEECH2019

  29. arXiv:1811.04076  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerat… ▽ More

    Submitted 9 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP2019

  30. arXiv:1811.01609  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is sui… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: Published in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/document/9113442

  31. arXiv:1809.10288  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

    Authors: Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Hirokazu Kameoka

    Abstract: We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited number of data because it is possible to represent and process interpretable acoustic features over a compact… ▽ More

    Submitted 28 September, 2018; v1 submitted 25 September, 2018; originally announced September 2018.

    Comments: SLT2018

  32. arXiv:1808.05092  [pdf, ps, other

    stat.ML cs.LG cs.SD eess.AS

    ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time depende… ▽ More

    Submitted 10 October, 2020; v1 submitted 13 August, 2018; originally announced August 2018.

    Comments: Publised in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/abstract/document/8718381 Please also refer to our related articles: arXiv:1806.02169, arXiv:2008.12604

  33. arXiv:1806.02169  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across dif… ▽ More

    Submitted 29 June, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

  34. arXiv:1804.02181  [pdf, ps, other

    eess.SP cs.LG stat.ML

    Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms

    Authors: Keisuke Oyamada, Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Hiroyasu Ando

    Abstract: In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. Since magnitude spectrograms do not contain phase information, we must restore or infer phase information to reconstruct a time-domain signal. One widely used approach for dealing with the signal reconstruction problem was proposed by Griffin and Lim. This meth… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.