Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Xie, Yuying; Kuhlmann, Michael; Rautenberg, Frederik; Tan, Zheng-Hua; Haeb-Umbach, Reinhold

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2409.03520 (eess)

[Submitted on 5 Sep 2024]

Title:Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Authors:Yuying Xie, Michael Kuhlmann, Frederik Rautenberg, Zheng-Hua Tan, Reinhold Haeb-Umbach

View PDF HTML (experimental)

Abstract:Speech signals encompass various information across multiple levels including content, speaker, and style. Disentanglement of these information, although challenging, is important for applications such as voice conversion. The contrastive predictive coding supported factorized variational autoencoder achieves unsupervised disentanglement of a speech signal into speaker and content embeddings by assuming speaker info to be temporally more stable than content-induced variations. However, this assumption may introduce other temporal stable information into the speaker embeddings, like environment or emotion, which we call style. In this work, we propose a method to further disentangle non-content features into distinct speaker and style features, notably by leveraging readily accessible and well-defined speaker labels without the necessity for style labels. Experimental results validate the proposed method's effectiveness on extracting disentangled features, thereby facilitating speaker, style, or combined speaker-style conversion.

Comments:	Accepted by EUSIPCO 2024
Subjects:	Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
Cite as:	arXiv:2409.03520 [eess.AS]
	(or arXiv:2409.03520v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2409.03520

Submission history

From: Yuying Xie [view email]
[v1] Thu, 5 Sep 2024 13:33:04 UTC (1,161 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators