Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Miao, Xiaoxiao; Zhang, Yuxiang; Wang, Xin; Tomashenko, Natalia; Soh, Donny Cheng Lock; Mcloughlin, Ian

Computer Science > Sound

arXiv:2408.05928 (cs)

[Submitted on 12 Aug 2024 (v1), last revised 23 Apr 2025 (this version, v2)]

Title:Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Authors:Xiaoxiao Miao, Yuxiang Zhang, Xin Wang, Natalia Tomashenko, Donny Cheng Lock Soh, Ian Mcloughlin

View PDF HTML (experimental)

Abstract:A general disentanglement-based speaker anonymization system typically separates speech into content, speaker, and prosody features using individual encoders. This paper explores how to adapt such a system when a new speech attribute, for example, emotion, needs to be preserved to a greater extent. While existing systems are good at anonymizing speaker embeddings, they are not designed to preserve emotion. Two strategies for this are examined. First, we show that integrating emotion embeddings from a pre-trained emotion encoder can help preserve emotional cues, even though this approach slightly compromises privacy protection. Alternatively, we propose an emotion compensation strategy as a post-processing step applied to anonymized speaker embeddings. This conceals the original speaker's identity and reintroduces the emotional traits lost during speaker embedding anonymization. Specifically, we model the emotion attribute using support vector machines to learn separate boundaries for each emotion. During inference, the original speaker embedding is processed in two ways: one, by an emotion indicator to predict emotion and select the emotion-matched SVM accurately; and two, by a speaker anonymizer to conceal speaker characteristics. The anonymized speaker embedding is then modified along the corresponding SVM boundary towards an enhanced emotional direction to save the emotional cues. The proposed strategies are also expected to be useful for adapting a general disentanglement-based speaker anonymization system to preserve other target paralinguistic attributes, with potential for a range of downstream tasks.

Comments:	Accepted by computer speech and language
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2408.05928 [cs.SD]
	(or arXiv:2408.05928v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2408.05928

Submission history

From: Xiaoxiao Miao [view email]
[v1] Mon, 12 Aug 2024 05:40:21 UTC (2,793 KB)
[v2] Wed, 23 Apr 2025 00:44:18 UTC (6,077 KB)

Computer Science > Sound

Title:Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators