Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Sun, Licai; Lian, Zheng; Liu, Bin; Tao, Jianhua

doi:10.1109/TAFFC.2023.3274829

Computer Science > Machine Learning

arXiv:2208.07589 (cs)

[Submitted on 16 Aug 2022 (v1), last revised 22 May 2023 (this version, v2)]

Title:Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Authors:Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao

View PDF

Abstract:With the proliferation of user-generated online videos, Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently. Despite significant progress, there are still two major challenges on the way towards robust MSA: 1) inefficiency when modeling cross-modal interactions in unaligned multimodal data; and 2) vulnerability to random modality feature missing which typically occurs in realistic settings. In this paper, we propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR). Concretely, EMT employs utterance-level representations from each modality as the global multimodal context to interact with local unimodal features and mutually promote each other. It not only avoids the quadratic scaling cost of previous local-local cross-modal interaction methods but also leads to better performance. To improve model robustness in the incomplete modality setting, on the one hand, DLFR performs low-level feature reconstruction to implicitly encourage the model to learn semantic information from incomplete data. On the other hand, it innovatively regards complete and incomplete data as two different views of one sample and utilizes siamese representation learning to explicitly attract their high-level representations. Comprehensive experiments on three popular datasets demonstrate that our method achieves superior performance in both complete and incomplete modality settings.

Comments:	Accepted by TAC. The code is available at this https URL
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2208.07589 [cs.LG]
	(or arXiv:2208.07589v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2208.07589
Journal reference:	IEEE Transactions on Affective Computing, 2023
Related DOI:	https://doi.org/10.1109/TAFFC.2023.3274829

Submission history

From: Licai Sun [view email]
[v1] Tue, 16 Aug 2022 08:02:30 UTC (1,009 KB)
[v2] Mon, 22 May 2023 02:27:07 UTC (1,981 KB)

Computer Science > Machine Learning

Title:Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators