-
SmolVLM: Redefining small and efficient multimodal models
Authors:
Andrés Marafioti,
Orr Zohar,
Miquel Farré,
Merve Noyan,
Elie Bakouch,
Pedro Cuenca,
Cyril Zakka,
Loubna Ben Allal,
Anton Lozhkov,
Nouamane Tazi,
Vaibhav Srivastav,
Joshua Lochner,
Hugo Larcher,
Mathieu Morlon,
Lewis Tunstall,
Leandro von Werra,
Thomas Wolf
Abstract:
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications.
We introduce SmolVLM, a serie…
▽ More
Large Vision-Language Models (VLMs) deliver exceptional performance but require significant computational resources, limiting their deployment on mobile and edge devices. Smaller VLMs typically mirror design choices of larger models, such as extensive image tokenization, leading to inefficient GPU memory usage and constrained practicality for on-device applications.
We introduce SmolVLM, a series of compact multimodal models specifically engineered for resource-efficient inference. We systematically explore architectural configurations, tokenization strategies, and data curation optimized for low computational overhead. Through this, we identify key design choices that yield substantial performance gains on image and video tasks with minimal memory footprints.
Our smallest model, SmolVLM-256M, uses less than 1GB GPU memory during inference and outperforms the 300-times larger Idefics-80B model, despite an 18-month development gap. Our largest model, at 2.2B parameters, rivals state-of-the-art VLMs consuming twice the GPU memory. SmolVLM models extend beyond static images, demonstrating robust video comprehension capabilities.
Our results emphasize that strategic architectural optimizations, aggressive yet efficient tokenization, and carefully curated training data significantly enhance multimodal performance, facilitating practical, energy-efficient deployments at significantly smaller scales.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Authors:
Ahmed Nassar,
Andres Marafioti,
Matteo Omenetti,
Maksym Lysak,
Nikolaos Livathinos,
Christoph Auer,
Lucas Morin,
Rafael Teixeira de Lima,
Yusik Kim,
A. Said Gurbuz,
Michele Dolfi,
Miquel Farré,
Peter W. J. Staar
Abstract:
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipeline…
▽ More
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.
△ Less
Submitted 14 March, 2025;
originally announced March 2025.
-
CinePile: A Long Video Question Answering Dataset and Benchmark
Authors:
Ruchit Rawal,
Khalid Saifullah,
Miquel Farré,
Ronen Basri,
David Jacobs,
Gowthami Somepalli,
Tom Goldstein
Abstract:
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This…
▽ More
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.
△ Less
Submitted 20 October, 2024; v1 submitted 14 May, 2024;
originally announced May 2024.
-
Towards an Interoperability Roadmap for the Energy Transition
Authors:
Valerie Reif,
Thomas I. Strasser,
Joseba Jimeno,
Marjolaine Farre,
Oliver Genest,
Amélie Gyrard,
Mark McGranaghan,
Gianluca Lipari,
Johann Schütz,
Mathias Uslar,
Sebastian Vogel,
Arsim Bytyqi,
Rita Dornmair,
Andreas Corusa,
Gaurav Roy,
Ferdinanda Ponci,
Alberto Dognini,
Antonello Monti
Abstract:
Smart grid interoperability is the means to achieve the twin green and digital transition but re-mains heterogeneous and fragmented to date. This work presents the first ideas and corner-stones of an Interoperability Roadmap for the Energy Transition that is being developed by the Horizon Europe int:net project. This roadmap builds on four cornerstones that address open interoperability issues. Th…
▽ More
Smart grid interoperability is the means to achieve the twin green and digital transition but re-mains heterogeneous and fragmented to date. This work presents the first ideas and corner-stones of an Interoperability Roadmap for the Energy Transition that is being developed by the Horizon Europe int:net project. This roadmap builds on four cornerstones that address open interoperability issues. These are a knowledge base to address the lack of convergence among existing initiatives, a maturity model and a network of testing and certification facilities to ad-dress the lack of practical tools for the industry, and a governance process to address the gap between standards-related approaches of Standards Development Organisations and Research and Innovation projects. A community of practice will be set up to ensure the continuity of the ongoing activities related to smart grid interoperability. To outlive the duration of the int:net project, the aim is to formalise the community of practice as a legal entity.
△ Less
Submitted 15 September, 2023;
originally announced September 2023.