Building and better understanding vision-language models: insights and future directions

Laurençon, Hugo; Marafioti, Andrés; Sanh, Victor; Tronchon, Léo

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.12637 (cs)

[Submitted on 22 Aug 2024]

Title:Building and better understanding vision-language models: insights and future directions

Authors:Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon

View PDF HTML (experimental)

Abstract:The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.12637 [cs.CV]
	(or arXiv:2408.12637v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.12637

Submission history

From: Hugo Laurençon [view email]
[v1] Thu, 22 Aug 2024 17:47:24 UTC (2,832 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Building and better understanding vision-language models: insights and future directions

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Building and better understanding vision-language models: insights and future directions

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators