Skip to main content

Showing 1–4 of 4 results for author: Tronchon, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.12637  [pdf, other

    cs.CV cs.AI

    Building and better understanding vision-language models: insights and future directions

    Authors: Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon

    Abstract: The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approache… ▽ More

    Submitted 22 August, 2024; originally announced August 2024.

  2. arXiv:2405.02246  [pdf, other

    cs.CV cs.AI

    What matters when building vision-language models?

    Authors: Hugo Laurençon, Léo Tronchon, Matthieu Cord, Victor Sanh

    Abstract: The growing interest in vision-language models (VLMs) has been driven by improvements in large language models and vision transformers. Despite the abundance of literature on this subject, we observe that critical decisions regarding the design of VLMs are often not justified. We argue that these unsupported decisions impede progress in the field by making it difficult to identify which choices im… ▽ More

    Submitted 3 May, 2024; originally announced May 2024.

  3. arXiv:2403.09029  [pdf, other

    cs.HC cs.AI cs.CV

    Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

    Authors: Hugo Laurençon, Léo Tronchon, Victor Sanh

    Abstract: Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML h… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  4. arXiv:2306.16527  [pdf, other

    cs.IR cs.CV

    OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

    Authors: Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, Victor Sanh

    Abstract: Large multimodal models trained on natural documents, which interleave images and text, outperform models trained on image-text pairs on various multimodal benchmarks. However, the datasets used to train these models have not been released, and the collection process has not been fully specified. We introduce the OBELICS dataset, an open web-scale filtered dataset of interleaved image-text documen… ▽ More

    Submitted 21 August, 2023; v1 submitted 21 June, 2023; originally announced June 2023.