Towards Foundation Models for 3D Vision: How Close Are We?

Zuo, Yiming; Kayan, Karhan; Wang, Maggie; Jeon, Kevin; Deng, Jia; Griffiths, Thomas L.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.10799v2 (cs)

[Submitted on 14 Oct 2024 (v1), last revised 9 Dec 2024 (this version, v2)]

Title:Towards Foundation Models for 3D Vision: How Close Are We?

Authors:Yiming Zuo, Karhan Kayan, Maggie Wang, Kevin Jeon, Jia Deng, Thomas L. Griffiths

View PDF HTML (experimental)

Abstract:Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark named UniQA-3D. UniQA-3D covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision. Code is available at this https URL .

Comments:	Accepted to 3DV 2025. Update 12/09/24: Change the benchmark name to UniQA-3D, add link to code
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.10799 [cs.CV]
	(or arXiv:2410.10799v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.10799

Submission history

From: Karhan Kayan [view email]
[v1] Mon, 14 Oct 2024 17:57:23 UTC (9,847 KB)
[v2] Mon, 9 Dec 2024 18:58:03 UTC (9,847 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Foundation Models for 3D Vision: How Close Are We?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Towards Foundation Models for 3D Vision: How Close Are We?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators