Multi-Utterance Speech Separation and Association Trained on Short Segments

Wang, Yuzhu; Politis, Archontis; Drossos, Konstantinos; Virtanen, Tuomas

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2507.02562 (eess)

[Submitted on 3 Jul 2025]

Title:Multi-Utterance Speech Separation and Association Trained on Short Segments

Authors:Yuzhu Wang, Archontis Politis, Konstantinos Drossos, Tuomas Virtanen

View PDF HTML (experimental)

Abstract:Current deep neural network (DNN) based speech separation faces a fundamental challenge -- while the models need to be trained on short segments due to computational constraints, real-world applications typically require processing significantly longer recordings with multiple utterances per speaker than seen during training. In this paper, we investigate how existing approaches perform in this challenging scenario and propose a frequency-temporal recurrent neural network (FTRNN) that effectively bridges this gap. Our FTRNN employs a full-band module to model frequency dependencies within each time frame and a sub-band module that models temporal patterns in each frequency band. Despite being trained on short fixed-length segments of 10 s, our model demonstrates robust separation when processing signals significantly longer than training segments (21-121 s) and preserves speaker association across utterance gaps exceeding those seen during training. Unlike the conventional segment-separation-stitch paradigm, our lightweight approach (0.9 M parameters) performs inference on long audio without segmentation, eliminating segment boundary distortions while simplifying deployment. Experimental results demonstrate the generalization ability of FTRNN for multi-utterance speech separation and speaker association.

Comments:	5 pages, accepted by WASPAA 2025
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2507.02562 [eess.AS]
	(or arXiv:2507.02562v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2507.02562

Submission history

From: Yuzhu Wang [view email]
[v1] Thu, 3 Jul 2025 12:12:34 UTC (19,131 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Utterance Speech Separation and Association Trained on Short Segments

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Multi-Utterance Speech Separation and Association Trained on Short Segments

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators