VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

Yang, Jinhyeok; Lee, Junmo; Kim, Youngik; Cho, Hoonyoung; Kim, Injung

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2007.15256 (eess)

[Submitted on 30 Jul 2020]

Title:VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

Authors:Jinhyeok Yang, Junmo Lee, Youngik Kim, Hoonyoung Cho, Injung Kim

View PDF

Abstract:We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS.

Comments:	Accepted to INTERSPEECH 2020
Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
Cite as:	arXiv:2007.15256 [eess.AS]
	(or arXiv:2007.15256v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2007.15256

Submission history

From: Jinhyeok Yang [view email]
[v1] Thu, 30 Jul 2020 06:33:53 UTC (540 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators