Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Tong, Haonan; Li, Haopeng; Du, Hongyang; Yang, Zhaohui; Yin, Changchuan; Niyato, Dusit

Computer Science > Multimedia

arXiv:2410.22112 (cs)

[Submitted on 29 Oct 2024]

Title:Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Authors:Haonan Tong, Haopeng Li, Hongyang Du, Zhaohui Yang, Changchuan Yin, Dusit Niyato

View PDF HTML (experimental)

Abstract:This paper studies an efficient multimodal data communication scheme for video conferencing. In our considered system, a speaker gives a talk to the audiences, with talking head video and audio being transmitted. Since the speaker does not frequently change posture and high-fidelity transmission of audio (speech and music) is required, redundant visual video data exists and can be removed by generating the video from the audio. To this end, we propose a wave-to-video (Wav2Vid) system, an efficient video transmission framework that reduces transmitted data by generating talking head video from audio. In particular, full-duration audio and short-duration video data are synchronously transmitted through a wireless channel, with neural networks (NNs) extracting and encoding audio and video semantics. The receiver then combines the decoded audio and video data, as well as uses a generative adversarial network (GAN) based model to generate the lip movement videos of the speaker. Simulation results show that the proposed Wav2Vid system can reduce the amount of transmitted data by up to 83% while maintaining the perceptual quality of the generated conferencing video.

Comments:	accepted by IEEE Wireless Communications Letters
Subjects:	Multimedia (cs.MM)
Cite as:	arXiv:2410.22112 [cs.MM]
	(or arXiv:2410.22112v1 [cs.MM] for this version)
	https://doi.org/10.48550/arXiv.2410.22112

Submission history

From: Haonan Tong [view email]
[v1] Tue, 29 Oct 2024 15:11:45 UTC (18,576 KB)

Computer Science > Multimedia

Title:Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Multimedia

Title:Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators