Towards a Unified Multimodal Reasoning Framework

Arun, Abhinav; Mal, Dipendra Singh; Soni, Mehul; Sawada, Tomohiro

Computer Science > Computation and Language

arXiv:2312.15021 (cs)

[Submitted on 22 Dec 2023]

Title:Towards a Unified Multimodal Reasoning Framework

Authors:Abhinav Arun, Dipendra Singh Mal, Mehul Soni, Tomohiro Sawada

View PDF HTML (experimental)

Abstract:Recent advancements in deep learning have led to the development of powerful language models (LMs) that excel in various tasks. Despite these achievements, there is still room for improvement, particularly in enhancing reasoning abilities and incorporating multimodal data. This report investigates the potential impact of combining Chain-of-Thought (CoT) reasoning and Visual Question Answering (VQA) techniques to improve LM's accuracy in solving multiple-choice questions. By employing TextVQA and ScienceQA datasets, we assessed the effectiveness of three text embedding methods and three visual embedding approaches. Our experiments aimed to fill the gap in current research by investigating the combined impact of CoT and VQA, contributing to the understanding of how these techniques can improve the reasoning capabilities of state-of-the-art models like GPT-4. Results from our experiments demonstrated the potential of these approaches in enhancing LM's reasoning and question-answering capabilities, providing insights for further research and development in the field, and paving the way for more accurate and reliable AI systems that can handle complex reasoning tasks across multiple modalities.

Comments:	6 pages, 11 figures
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2312.15021 [cs.CL]
	(or arXiv:2312.15021v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.15021

Submission history

From: Abhinav Arun [view email]
[v1] Fri, 22 Dec 2023 19:07:00 UTC (3,397 KB)

Computer Science > Computation and Language

Title:Towards a Unified Multimodal Reasoning Framework

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Towards a Unified Multimodal Reasoning Framework

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators