Multimodal Dual Attention Memory for Video Story Question Answering

Kim, Kyung-Min; Choi, Seong-Ho; Kim, Jin-Hwa; Zhang, Byoung-Tak

Computer Science > Computer Vision and Pattern Recognition

arXiv:1809.07999 (cs)

[Submitted on 21 Sep 2018]

Title:Multimodal Dual Attention Memory for Video Story Question Answering

Authors:Kyung-Min Kim, Seong-Ho Choi, Jin-Hwa Kim, Byoung-Tak Zhang

View PDF

Abstract:We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

Comments:	Accepted for ECCV 2018
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:1809.07999 [cs.CV]
	(or arXiv:1809.07999v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1809.07999

Submission history

From: Kyungmin Kim [view email]
[v1] Fri, 21 Sep 2018 09:19:12 UTC (8,056 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Dual Attention Memory for Video Story Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Multimodal Dual Attention Memory for Video Story Question Answering

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators