MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Ahire, Vrushank; Shah, Kunal; Khan, Mudasir Nazir; Pakhale, Nikhil; Sookha, Lownish Rai; Ganaie, M. A.; Dhall, Abhinav

Computer Science > Machine Learning

arXiv:2503.12623 (cs)

[Submitted on 16 Mar 2025 (v1), last revised 2 May 2025 (this version, v2)]

Title:MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Authors:Vrushank Ahire, Kunal Shah, Mudasir Nazir Khan, Nikhil Pakhale, Lownish Rai Sookha, M. A. Ganaie, Abhinav Dhall

View PDF HTML (experimental)

Abstract:Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations. The code is available at: this https URL

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Cite as:	arXiv:2503.12623 [cs.LG]
	(or arXiv:2503.12623v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2503.12623

Submission history

From: Vrushank Ahire [view email]
[v1] Sun, 16 Mar 2025 19:32:32 UTC (346 KB)
[v2] Fri, 2 May 2025 07:17:44 UTC (2,407 KB)

Computer Science > Machine Learning

Title:MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators