Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective
Authors:
Krishna Singh Rajput,
Tejas Anvekar,
Chitta Baral,
Vivek Gupta
Abstract:
Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address the…
▽ More
Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
On Arrhythmia Detection by Deep Learning and Multidimensional Representation
Authors:
K. S. Rajput,
S. Wibowo,
C. Hao,
M. Majmudar
Abstract:
An electrocardiogram (ECG) is a time-series signal that is represented by one-dimensional (1-D) data. Higher dimensional representation contains more information that is accessible for feature extraction. Hidden variables such as frequency relation and morphology of segment is not directly accessible in the time domain. In this paper, 1-D time series data is converted into multi-dimensional repres…
▽ More
An electrocardiogram (ECG) is a time-series signal that is represented by one-dimensional (1-D) data. Higher dimensional representation contains more information that is accessible for feature extraction. Hidden variables such as frequency relation and morphology of segment is not directly accessible in the time domain. In this paper, 1-D time series data is converted into multi-dimensional representation in the form of multichannel 2-D images. Following that, deep learning was used to train a deep neural network based classifier to detect arrhythmias. The results of simulation on testing database demonstrate the effectiveness of the proposed methodology by showing an outstanding classification performance compared to other existing methods and hand-crafted annotations made by certified cardiologists.
△ Less
Submitted 11 April, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.