MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Haque, Md Rezwanul; Islam, Md. Milon; Raju, S M Taslim Uddin; Altaheri, Hamdi; Nassar, Lobna; Karray, Fakhri

Computer Science > Computer Vision and Pattern Recognition

arXiv:2508.06701 (cs)

[Submitted on 8 Aug 2025]

Title:MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Authors:Md Rezwanul Haque, Md. Milon Islam, S M Taslim Uddin Raju, Hamdi Altaheri, Lobna Nassar, Fakhri Karray

View PDF HTML (experimental)

Abstract:Depression is a serious mental health illness that significantly affects an individual's well-being and quality of life, making early detection crucial for adequate care and treatment. Detecting depression is often difficult, as it is based primarily on subjective evaluations during clinical interviews. Hence, the early diagnosis of depression, thanks to the content of social networks, has become a prominent research area. The extensive and diverse nature of user-generated information poses a significant challenge, limiting the accurate extraction of relevant temporal information and the effective fusion of data across multiple modalities. This paper introduces MMFformer, a multimodal depression detection network designed to retrieve depressive spatio-temporal high-level patterns from multimodal social media information. The transformer network with residual connections captures spatial features from videos, and a transformer encoder is exploited to design important temporal dynamics in audio. Moreover, the fusion architecture fused the extracted features through late and intermediate fusion strategies to find out the most relevant intermodal correlations among them. Finally, the proposed network is assessed on two large-scale depression detection datasets, and the results clearly reveal that it surpasses existing state-of-the-art approaches, improving the F1-Score by 13.92% for D-Vlog dataset and 7.74% for LMVD dataset. The code is made available publicly at this https URL.

Comments:	Accepted for the 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Vienna, Austria
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2508.06701 [cs.CV]
	(or arXiv:2508.06701v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2508.06701

Submission history

From: Md Rezwanul Haque [view email]
[v1] Fri, 8 Aug 2025 21:03:29 UTC (285 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MMFformer: Multimodal Fusion Transformer Network for Depression Detection

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators