Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Mamaghan, Amir Mohammad Karimi; Papa, Samuele; Johansson, Karl Henrik; Bauer, Stefan; Dittadi, Andrea

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.15589 (cs)

[Submitted on 22 Jul 2024 (v1), last revised 3 Mar 2025 (this version, v5)]

Title:Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Authors:Amir Mohammad Karimi Mamaghan, Samuele Papa, Karl Henrik Johansson, Stefan Bauer, Andrea Dittadi

View PDF HTML (experimental)

Abstract:Object-centric (OC) representations, which model visual scenes as compositions of discrete objects, have the potential to be used in various downstream tasks to achieve systematic compositional generalization and facilitate reasoning. However, these claims have yet to be thoroughly validated empirically. Recently, foundation models have demonstrated unparalleled capabilities across diverse domains, from language to computer vision, positioning them as a potential cornerstone of future research for a wide range of computational tasks. In this paper, we conduct an extensive empirical study on representation learning for downstream Visual Question Answering (VQA), which requires an accurate compositional understanding of the scene. We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches including large pre-trained foundation models on both synthetic and real-world data, ultimately identifying a promising path to leverage the strengths of both paradigms. The extensiveness of our study, encompassing over 600 downstream VQA models and 15 different types of upstream representations, also provides several additional insights that we believe will be of interest to the community at large.

Comments:	Published at ICLR 2025
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2407.15589 [cs.CV]
	(or arXiv:2407.15589v5 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.15589

Submission history

From: Amir Mohammad Karimi Mamaghan [view email]
[v1] Mon, 22 Jul 2024 12:26:08 UTC (3,518 KB)
[v2] Fri, 13 Sep 2024 10:47:25 UTC (3,518 KB)
[v3] Sat, 19 Oct 2024 03:59:31 UTC (3,517 KB)
[v4] Fri, 28 Feb 2025 17:32:26 UTC (2,031 KB)
[v5] Mon, 3 Mar 2025 11:48:03 UTC (2,031 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators