-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Authors:
Loubna Ben Allal,
Anton Lozhkov,
Elie Bakouch,
Gabriel Martín Blázquez,
Guilherme Penedo,
Lewis Tunstall,
Andrés Marafioti,
Hynek Kydlíček,
Agustín Piqueres Lajarín,
Vaibhav Srivastav,
Joshua Lochner,
Caleb Fahlgren,
Xuan-Son Nguyen,
Clémentine Fourrier,
Ben Burtenshaw,
Hugo Larcher,
Haojun Zhao,
Cyril Zakka,
Mathieu Morlon,
Colin Raffel,
Leandro von Werra,
Thomas Wolf
Abstract:
While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain…
▽ More
While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art "small" (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
The Future of Open Human Feedback
Authors:
Shachar Don-Yehiya,
Ben Burtenshaw,
Ramon Fernandez Astudillo,
Cailean Osborne,
Mimansa Jaiswal,
Tzu-Sheng Kuo,
Wenting Zhao,
Idan Shenfeld,
Andi Peng,
Mikhail Yurochkin,
Atoosa Kasirzadeh,
Yangsibo Huang,
Tatsunori Hashimoto,
Yacine Jernite,
Daniel Vila-Suero,
Omri Abend,
Jennifer Ding,
Sara Hooker,
Hannah Rose Kirk,
Leshem Choshen
Abstract:
Human feedback on conversations with language language models (LLMs) is central to how these systems learn about the world, improve their capabilities, and are steered toward desirable and safe behaviors. However, this feedback is mostly collected by frontier AI labs and kept behind closed doors. In this work, we bring together interdisciplinary experts to assess the opportunities and challenges t…
▽ More
Human feedback on conversations with language language models (LLMs) is central to how these systems learn about the world, improve their capabilities, and are steered toward desirable and safe behaviors. However, this feedback is mostly collected by frontier AI labs and kept behind closed doors. In this work, we bring together interdisciplinary experts to assess the opportunities and challenges to realizing an open ecosystem of human feedback for AI. We first look for successful practices in peer production, open source, and citizen science communities. We then characterize the main challenges for open human feedback. For each, we survey current approaches and offer recommendations. We end by envisioning the components needed to underpin a sustainable and open human feedback ecosystem. In the center of this ecosystem are mutually beneficial feedback loops, between users and specialized models, incentivizing a diverse stakeholders community of model trainers and feedback providers to support a general open feedback pool.
△ Less
Submitted 4 September, 2024; v1 submitted 15 August, 2024;
originally announced August 2024.
-
AI Stories: An Interactive Narrative System for Children
Authors:
Ben Burtenshaw
Abstract:
AI Stories is a proposed interactive dialogue system, that lets children co-create narrative worlds through conversation. Over the next three years this system will be developed and tested within pediatric wards, where it offers a useful resource between the gap of education and play. Telling and making stories is a fundamental part of language play, and its chatty and nonsensical qualities are im…
▽ More
AI Stories is a proposed interactive dialogue system, that lets children co-create narrative worlds through conversation. Over the next three years this system will be developed and tested within pediatric wards, where it offers a useful resource between the gap of education and play. Telling and making stories is a fundamental part of language play, and its chatty and nonsensical qualities are important; therefore, the prologued usage an automated system offers is a benefit to children. In this paper I will present the current state of this project, in its more experimental and general guise. Conceptually story-telling through dialogue relates to the preprint interpretation of story, beyond the static and linear medium, where stories were performative, temporal, and social.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.