Towards Disentangled Speech Representations

Peyser, Cal; Sainath, Ronny Huang Andrew Rosenberg Tara N.; Picheny, Michael; Cho, Kyunghyun

Computer Science > Sound

arXiv:2208.13191 (cs)

[Submitted on 28 Aug 2022]

Title:Towards Disentangled Speech Representations

Authors:Cal Peyser, Ronny Huang Andrew Rosenberg Tara N. Sainath, Michael Picheny, Kyunghyun Cho

View PDF

Abstract:The careful construction of audio representations has become a dominant feature in the design of approaches to many speech tasks. Increasingly, such approaches have emphasized "disentanglement", where a representation contains only parts of the speech signal relevant to transcription while discarding irrelevant information. In this paper, we construct a representation learning task based on joint modeling of ASR and TTS, and seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not. We present empirical evidence that successfully finding such a representation is tied to the randomness inherent in training. We then make the observation that these desired, disentangled solutions to the optimization problem possess unique statistical properties. Finally, we show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task. These observations motivate a novel approach to learning effective audio representations.

Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2208.13191 [cs.SD]
	(or arXiv:2208.13191v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2208.13191

Submission history

From: Cal Peyser [view email]
[v1] Sun, 28 Aug 2022 10:03:55 UTC (2,534 KB)

Computer Science > Sound

Title:Towards Disentangled Speech Representations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Towards Disentangled Speech Representations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators