VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Ren, Yufan; Tertikas, Konstantinos; Maiti, Shalini; Han, Junlin; Zhang, Tong; Süsstrunk, Sabine; Kokkinos, Filippos

Computer Science > Computer Vision and Pattern Recognition

arXiv:2503.23064 (cs)

[Submitted on 29 Mar 2025 (v1), last revised 2 Apr 2025 (this version, v2)]

Title:VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Authors:Yufan Ren, Konstantinos Tertikas, Shalini Maiti, Junlin Han, Tong Zhang, Sabine Süsstrunk, Filippos Kokkinos

View PDF HTML (experimental)

Abstract:Large Vision-Language Models (LVLMs) struggle with puzzles, which require precise perception, rule comprehension, and logical reasoning. Assessing and enhancing their performance in this domain is crucial, as it reflects their ability to engage in structured reasoning - an essential skill for real-world problem-solving. However, existing benchmarks primarily evaluate pre-trained models without additional training or fine-tuning, often lack a dedicated focus on reasoning, and fail to establish a systematic evaluation framework. To address these limitations, we introduce VGRP-Bench, a Visual Grid Reasoning Puzzle Benchmark featuring 20 diverse puzzles. VGRP-Bench spans multiple difficulty levels, and includes extensive experiments not only on existing chat LVLMs (e.g., GPT-4o), but also on reasoning LVLMs (e.g., Gemini-Thinking). Our results reveal that even the state-of-the-art LVLMs struggle with these puzzles, highlighting fundamental limitations in their puzzle-solving capabilities. Most importantly, through systematic experiments, we identify and analyze key factors influencing LVLMs' puzzle-solving performance, including the number of clues, grid size, and rule complexity. Furthermore, we explore two Supervised Fine-Tuning (SFT) strategies that can be used in post-training: SFT on solutions (S-SFT) and SFT on synthetic reasoning processes (R-SFT). While both methods significantly improve performance on trained puzzles, they exhibit limited generalization to unseen ones. We will release VGRP-Bench to facilitate further research on LVLMs for complex, real-world problem-solving. Project page: this https URL.

Comments:	8 pages; Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2503.23064 [cs.CV]
	(or arXiv:2503.23064v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2503.23064

Submission history

From: Yufan Ren [view email]
[v1] Sat, 29 Mar 2025 12:50:38 UTC (5,634 KB)
[v2] Wed, 2 Apr 2025 07:10:05 UTC (5,634 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VGRP-Bench: Visual Grid Reasoning Puzzle Benchmark for Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators