IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Shahgir, Haz Sameen; Sayeed, Khondker Salman; Bhattacharjee, Abhik; Ahmad, Wasi Uddin; Dong, Yue; Shahriyar, Rifat

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.15952 (cs)

[Submitted on 23 Mar 2024 (v1), last revised 9 Aug 2024 (this version, v3)]

Title:IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Authors:Haz Sameen Shahgir, Khondker Salman Sayeed, Abhik Bhattacharjee, Wasi Uddin Ahmad, Yue Dong, Rifat Shahriyar

View PDF HTML (experimental)

Abstract:The advent of Vision Language Models (VLM) has allowed researchers to investigate the visual understanding of a neural network using natural language. Beyond object classification and detection, VLMs are capable of visual comprehension and common-sense reasoning. This naturally led to the question: How do VLMs respond when the image itself is inherently unreasonable? To this end, we present IllusionVQA: a diverse dataset of challenging optical illusions and hard-to-interpret scenes to test the capability of VLMs in two distinct multiple-choice VQA tasks - comprehension and soft localization. GPT4V, the best performing VLM, achieves 62.99% accuracy (4-shot) on the comprehension task and 49.7% on the localization task (4-shot and Chain-of-Thought). Human evaluation reveals that humans achieve 91.03% and 100% accuracy in comprehension and localization. We discover that In-Context Learning (ICL) and Chain-of-Thought reasoning substantially degrade the performance of Gemini-Pro in the localization task. Tangentially, we discover a potential weakness in the ICL capabilities of VLMs: they fail to locate optical illusions even when the correct answer is in the context window as a few-shot example.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2403.15952 [cs.CV]
	(or arXiv:2403.15952v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.15952

Submission history

From: Haz Sameen Shahgir [view email]
[v1] Sat, 23 Mar 2024 23:06:32 UTC (3,101 KB)
[v2] Sat, 30 Mar 2024 13:21:42 UTC (3,235 KB)
[v3] Fri, 9 Aug 2024 14:26:02 UTC (2,855 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:IllusionVQA: A Challenging Optical Illusion Dataset for Vision Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators