Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Rajput, Krishna Singh; Anvekar, Tejas; Baral, Chitta; Gupta, Vivek

Computer Science > Computation and Language

arXiv:2505.20816 (cs)

[Submitted on 27 May 2025]

Title:Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Authors:Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta

View PDF HTML (experimental)

Abstract:Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2505.20816 [cs.CL]
	(or arXiv:2505.20816v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2505.20816

Submission history

From: Tejas Anvekar [view email]
[v1] Tue, 27 May 2025 07:23:38 UTC (498 KB)

Computer Science > Computation and Language

Title:Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators