Skip to main content

Showing 1–28 of 28 results for author: Kaneko, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2409.02245  [pdf, other

    cs.SD cs.AI cs.LG eess.AS stat.ML

    FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo

    Abstract: Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to… ▽ More

    Submitted 3 September, 2024; originally announced September 2024.

    Comments: Accepted to Interspeech 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/

  2. arXiv:2403.16464  [pdf, other

    cs.SD cs.LG eess.AS

    Training Generative Adversarial Network-Based Vocoder with Limited Data Using Augmentation-Conditional Discriminator

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka

    Abstract: A generative adversarial network (GAN)-based vocoder trained with an adversarial discriminator is commonly used for speech synthesis because of its fast, lightweight, and high-quality characteristics. However, this data-driven model requires a large amount of training data incurring high data-collection costs. This fact motivates us to train a GAN-based vocoder on limited data. A promising solutio… ▽ More

    Submitted 25 March, 2024; originally announced March 2024.

    Comments: Accepted to ICASSP 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/augcondd/

  3. arXiv:2310.01821  [pdf, other

    cs.CV cs.AI cs.GR cs.LG eess.IV

    MIMO-NeRF: Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields

    Authors: Takuhiro Kaneko

    Abstract: Neural radiance fields (NeRFs) have shown impressive results for novel view synthesis. However, they depend on the repetitive use of a single-input single-output multilayer perceptron (SISO MLP) that maps 3D coordinates and view direction to the color and volume density in a sample-wise manner, which slows the rendering. We propose a multi-input multi-output NeRF (MIMO-NeRF) that reduces the numbe… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

    Comments: Accepted to ICCV 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/mimo-nerf/

  4. arXiv:2308.07117  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: The inverse short-time Fourier transform network (iSTFTNet) has garnered attention owing to its fast, lightweight, and high-fidelity speech synthesis. It obtains these characteristics using a fast and lightweight 1D CNN as the backbone and replacing some neural processes with iSTFT. Owing to the difficulty of a 1D CNN to model high-dimensional spectrograms, the frequency dimension is reduced via t… ▽ More

    Submitted 14 August, 2023; originally announced August 2023.

    Comments: Accepted to Interspeech 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet2/

  5. arXiv:2303.13909  [pdf, other

    cs.SD cs.LG eess.AS eess.SP

    Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Shogo Seki

    Abstract: In speech synthesis, a generative adversarial network (GAN), training a generator (speech synthesizer) and a discriminator in a min-max game, is widely used to improve speech quality. An ensemble of discriminators is commonly used in recent neural vocoders (e.g., HiFi-GAN) and end-to-end text-to-speech (TTS) systems (e.g., VITS) to scrutinize waveforms from multiple perspectives. Such discriminato… ▽ More

    Submitted 24 March, 2023; originally announced March 2023.

    Comments: Accepted to ICASSP 2023. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/waveunetd/

  6. arXiv:2209.14397  [pdf, ps, other

    eess.SP cs.CV cs.LG

    Variational Bayes for robust radar single object tracking

    Authors: Alp Sarı, Tak Kaneko, Lense H. M. Swaenen, Wouter M. Kouw

    Abstract: We address object tracking by radar and the robustness of the current state-of-the-art methods to process outliers. The standard tracking algorithms extract detections from radar image space to use it in the filtering stage. Filtering is performed by a Kalman filter, which assumes Gaussian distributed noise. However, this assumption does not account for large modeling errors and results in poor tr… ▽ More

    Submitted 28 September, 2022; originally announced September 2022.

    Comments: 6 pages, 8 figures. Published as part of the proceedings of the IEEE International Workshop on Signal Processing Systems 2022

  7. arXiv:2206.06100  [pdf, other

    cs.CV cs.AI cs.GR cs.LG eess.IV

    AR-NeRF: Unsupervised Learning of Depth and Defocus Effects from Natural Images with Aperture Rendering Neural Radiance Fields

    Authors: Takuhiro Kaneko

    Abstract: Fully unsupervised 3D representation learning has gained attention owing to its advantages in data collection. A successful approach involves a viewpoint-aware approach that learns an image distribution based on generative models (e.g., generative adversarial networks (GANs)) while generating various view images based on 3D-aware models (e.g., neural radiance fields (NeRFs)). However, they require… ▽ More

    Submitted 13 June, 2022; originally announced June 2022.

    Comments: Accepted to CVPR 2022. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/ar-nerf/

  8. arXiv:2203.02395  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

    Authors: Takuhiro Kaneko, Kou Tanaka, Hirokazu Kameoka, Shogo Seki

    Abstract: In recent text-to-speech synthesis and voice conversion systems, a mel-spectrogram is commonly applied as an intermediate representation, and the necessity for a mel-spectrogram vocoder is increasing. A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-… ▽ More

    Submitted 4 March, 2022; originally announced March 2022.

    Comments: Accepted to ICASSP 2022. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/istftnet/

  9. arXiv:2106.13041  [pdf, other

    cs.CV cs.LG eess.IV stat.ML

    Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks

    Authors: Takuhiro Kaneko

    Abstract: Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoi… ▽ More

    Submitted 24 June, 2021; originally announced June 2021.

    Comments: Accepted to CVPR 2021 (Oral). Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/ar-gan/

  10. arXiv:2104.06900  [pdf, ps, other

    cs.SD eess.AS

    FastS2S-VC: Streaming Non-Autoregressive Sequence-to-Sequence Voice Conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko

    Abstract: This paper proposes a non-autoregressive extension of our previously proposed sequence-to-sequence (S2S) model-based voice conversion (VC) methods. S2S model-based VC methods have attracted particular attention in recent years for their flexibility in converting not only the voice identity but also the pitch contour and local duration of input speech, thanks to the ability of the encoder-decoder a… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  11. arXiv:2102.12841  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However, owing to their insufficient ability to grasp time-frequency structures, their application is limited to mel-cepstrum conversion and not mel-spectrogram conversion d… ▽ More

    Submitted 25 February, 2021; originally announced February 2021.

    Comments: Accepted to ICASSP 2021. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/maskcyclegan-vc/index.html

  12. arXiv:2010.11672  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising results regarding this problem and have been widely used as benchmark methods. However, owing to the ambiguity of the effectiveness of CycleGAN-VC/VC2 for mel-sp… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted to Interspeech 2020. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc3/index.html

  13. arXiv:2010.02977  [pdf, ps, other

    cs.SD eess.AS

    VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki

    Abstract: In this paper, we propose a non-parallel any-to-many voice conversion (VC) method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel waveform generation method, VoiceGrad is based upon the concepts of score matching and Langevin dynamics. It uses weighted denoising score matching to train a score approximator, a fully convolutional network with a U-Net structure designed to predic… ▽ More

    Submitted 9 March, 2024; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: For more details on the baseline method used for comparison, please refer to our article in arXiv:2008.12604

  14. arXiv:2008.12604  [pdf, ps, other

    eess.AS stat.ML

    Nonparallel Voice Conversion with Augmented Classifier Star Generative Adversarial Networks

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: We previously proposed a method that allows for nonparallel voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN. The main features of our method, called StarGAN-VC, are as follows: First, it requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training. Second, it can simultaneously learn mappings across mu… ▽ More

    Submitted 10 November, 2020; v1 submitted 27 August, 2020; originally announced August 2020.

    Comments: Submitted to IEEE/ACM Trans. ASLP. This paper is an extended full-paper version of arXiv:1806.02169

  15. arXiv:2006.05513  [pdf

    physics.med-ph cs.CV eess.IV

    A Deep Learning-Based Method for Automatic Segmentation of Proximal Femur from Quantitative Computed Tomography Images

    Authors: Chen Zhao, Joyce H. Keyak, Jinshan Tang, Tadashi S. Kaneko, Sundeep Khosla, Shreyasee Amin, Elizabeth J. Atkinson, Lan-Juan Zhao, Michael J. Serou, Chaoyang Zhang, Hui Shen, Hong-Wen Deng, Weihua Zhou

    Abstract: Purpose: Proximal femur image analyses based on quantitative computed tomography (QCT) provide a method to quantify the bone density and evaluate osteoporosis and risk of fracture. We aim to develop a deep-learning-based method for automatic proximal femur segmentation. Methods and Materials: We developed a 3D image segmentation method based on V-Net, an end-to-end fully convolutional neural netwo… ▽ More

    Submitted 1 July, 2020; v1 submitted 9 June, 2020; originally announced June 2020.

  16. arXiv:2005.08445  [pdf, ps, other

    eess.AS cs.SD stat.ML

    Many-to-Many Voice Transformer Network

    Authors: Hirokazu Kameoka, Wen-Chin Huang, Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Tomoki Toda

    Abstract: This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously proposed an S2S-based VC method using a transformer network architecture called the voice transformer network (VTN). The original VTN was designed to learn only a m… ▽ More

    Submitted 6 November, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: submitted to IEEE/ACM Trans. ASLP. Please also refer to our related article: arXiv:1811.01609

  17. arXiv:2003.07849  [pdf, other

    cs.CV cs.LG eess.IV stat.ML

    Blur, Noise, and Compression Robust Generative Adversarial Networks

    Authors: Takuhiro Kaneko, Tatsuya Harada

    Abstract: Generative adversarial networks (GANs) have gained considerable attention owing to their ability to reproduce images. However, they can recreate training images faithfully despite image degradation in the form of blur, noise, and compression, generating similarly degraded images. To solve this problem, the recently proposed noise robust GAN (NR-GAN) provides a partial solution by demonstrating the… ▽ More

    Submitted 23 June, 2021; v1 submitted 17 March, 2020; originally announced March 2020.

    Comments: Accepted to CVPR 2021. Project page: https://takuhirok.github.io/BNCR-GAN/

  18. arXiv:1911.11776  [pdf, other

    cs.CV cs.LG eess.IV stat.ML

    Noise Robust Generative Adversarial Networks

    Authors: Takuhiro Kaneko, Tatsuya Harada

    Abstract: Generative adversarial networks (GANs) are neural networks that learn data distributions through adversarial training. In intensive studies, recent GANs have shown promising results for reproducing training images. However, in spite of noise, they reproduce images with fidelity. As an alternative, we propose a novel family of GANs called noise robust GANs (NR-GANs), which can learn a clean image g… ▽ More

    Submitted 31 March, 2020; v1 submitted 26 November, 2019; originally announced November 2019.

    Comments: Accepted to CVPR 2020. Project page: https://takuhirok.github.io/NR-GAN/

  19. arXiv:1907.12279  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. H… ▽ More

    Submitted 7 August, 2019; v1 submitted 29 July, 2019; originally announced July 2019.

    Comments: Accepted to Interspeech 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/stargan-vc2/index.html

  20. arXiv:1904.04631  [pdf, other

    cs.SD cs.LG eess.AS stat.ML

    CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion

    Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Nobukatsu Hojo

    Abstract: Non-parallel voice conversion (VC) is a technique for learning the mapping from source to target speech without relying on parallel data. This is an important task, but it has been challenging due to the disadvantages of the training conditions. Recently, CycleGAN-VC has provided a breakthrough and performed comparably to a parallel VC method without relying on any extra data, modules, or time ali… ▽ More

    Submitted 9 April, 2019; originally announced April 2019.

    Comments: Accepted to ICASSP 2019. Project page: http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html

  21. arXiv:1904.02892  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    WaveCycleGAN2: Time-domain Neural Post-filter for Speech Waveform Generation

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: WaveCycleGAN has recently been proposed to bridge the gap between natural and synthesized speech waveforms in statistical parametric speech synthesis and provides fast inference with a moving average model rather than an autoregressive model and high-quality speech synthesis with the adversarial training. However, the human ear can still distinguish the processed speech waveforms from natural ones… ▽ More

    Submitted 8 April, 2019; v1 submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted to INTERSPEECH2019

  22. arXiv:1811.04076  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

    Authors: Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerat… ▽ More

    Submitted 9 November, 2018; originally announced November 2018.

    Comments: Submitted to ICASSP2019

  23. arXiv:1811.01609  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

    Authors: Hirokazu Kameoka, Kou Tanaka, Damian Kwasny, Takuhiro Kaneko, Nobukatsu Hojo

    Abstract: This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed method, called ConvS2S-VC, has three key features. First, it uses a model with a fully convolutional architecture. This is particularly advantageous in that it is sui… ▽ More

    Submitted 6 October, 2020; v1 submitted 5 November, 2018; originally announced November 2018.

    Comments: Published in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/document/9113442

  24. arXiv:1809.10288  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

    Authors: Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, Hirokazu Kameoka

    Abstract: We propose a learning-based filter that allows us to directly modify a synthetic speech waveform into a natural speech waveform. Speech-processing systems using a vocoder framework such as statistical parametric speech synthesis and voice conversion are convenient especially for a limited number of data because it is possible to represent and process interpretable acoustic features over a compact… ▽ More

    Submitted 28 September, 2018; v1 submitted 25 September, 2018; originally announced September 2018.

    Comments: SLT2018

  25. arXiv:1808.05092  [pdf, ps, other

    stat.ML cs.LG cs.SD eess.AS

    ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a non-parallel many-to-many voice conversion (VC) method using a variant of the conditional variational autoencoder (VAE) called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, it adopts fully convolutional architectures to construct the encoder and decoder networks so that the networks can learn conversion rules that capture time depende… ▽ More

    Submitted 10 October, 2020; v1 submitted 13 August, 2018; originally announced August 2018.

    Comments: Publised in IEEE/ACM Trans. ASLP https://ieeexplore.ieee.org/abstract/document/8718381 Please also refer to our related articles: arXiv:1806.02169, arXiv:2008.12604

  26. arXiv:1806.02169  [pdf, ps, other

    cs.SD cs.LG eess.AS stat.ML

    StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

    Authors: Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo

    Abstract: This paper proposes a method that allows non-parallel many-to-many voice conversion (VC) by using a variant of a generative adversarial network (GAN) called StarGAN. Our method, which we call StarGAN-VC, is noteworthy in that it (1) requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training, (2) simultaneously learns many-to-many mappings across dif… ▽ More

    Submitted 29 June, 2018; v1 submitted 6 June, 2018; originally announced June 2018.

  27. arXiv:1804.02181  [pdf, ps, other

    eess.SP cs.LG stat.ML

    Generative adversarial network-based approach to signal reconstruction from magnitude spectrograms

    Authors: Keisuke Oyamada, Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Hiroyasu Ando

    Abstract: In this paper, we address the problem of reconstructing a time-domain signal (or a phase spectrogram) solely from a magnitude spectrogram. Since magnitude spectrograms do not contain phase information, we must restore or infer phase information to reconstruct a time-domain signal. One widely used approach for dealing with the signal reconstruction problem was proposed by Griffin and Lim. This meth… ▽ More

    Submitted 6 April, 2018; originally announced April 2018.

  28. arXiv:1711.11293  [pdf, ps, other

    stat.ML cs.SD eess.AS

    Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks

    Authors: Takuhiro Kaneko, Hirokazu Kameoka

    Abstract: We propose a parallel-data-free voice-conversion (VC) method that can learn a mapping from source to target speech without relying on parallel data. The proposed method is general purpose, high quality, and parallel-data free and works without any extra data, modules, or alignment procedure. It also avoids over-smoothing, which occurs in many conventional statistical model-based VC methods. Our me… ▽ More

    Submitted 20 December, 2017; v1 submitted 30 November, 2017; originally announced November 2017.