VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Li, Junjie; Ge, Meng; Pan, Zexu; Wang, Longbiao; Dang, Jianwu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2210.06177 (cs)

[Submitted on 9 Oct 2022]

Title:VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Authors:Junjie Li, Meng Ge, Zexu Pan, Longbiao Wang, Jianwu Dang

View PDF

Abstract:Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2210.06177 [cs.CV]
	(or arXiv:2210.06177v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2210.06177

Submission history

From: Junjie Li [view email]
[v1] Sun, 9 Oct 2022 12:29:38 UTC (167 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators