Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Kang, Feiyang; Just, Hoang Anh; Sun, Yifan; Jahagirdar, Himanshu; Zhang, Yuanzhi; Du, Rongxing; Sahu, Anit Kumar; Jia, Ruoxi

Computer Science > Machine Learning

arXiv:2405.02774 (cs)

[Submitted on 5 May 2024]

Title:Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Authors:Feiyang Kang, Hoang Anh Just, Yifan Sun, Himanshu Jahagirdar, Yuanzhi Zhang, Rongxing Du, Anit Kumar Sahu, Ruoxi Jia

View PDF HTML (experimental)

Abstract:This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired performance levels. While many data selection algorithms have been designed for small-scale applications, rendering them unsuitable for our context, some emerging methods do cater to language data scales. However, they often prioritize data that aligns with the target distribution. While this strategy may be effective when training a model from scratch, it can yield limited results when the model has already been pre-trained on a different distribution. Differing from prior work, our key idea is to select data that nudges the pre-training distribution closer to the target distribution. We show the optimality of this approach for fine-tuning tasks under certain conditions. We demonstrate the efficacy of our methodology across a diverse array of tasks (NLU, NLG, zero-shot) with models up to 2.7B, showing that it consistently surpasses other selection methods. Moreover, our proposed method is significantly faster than existing techniques, scaling to millions of samples within a single GPU hour. Our code is open-sourced (Code repository: this https URL ). While fine-tuning offers significant potential for enhancing performance across diverse tasks, its associated costs often limit its widespread adoption; with this work, we hope to lay the groundwork for cost-effective fine-tuning, making its benefits more accessible.

Comments:	Published as a conference paper at ICLR 2024
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2405.02774 [cs.LG]
	(or arXiv:2405.02774v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2405.02774

Submission history

From: Feiyang Kang [view email]
[v1] Sun, 5 May 2024 00:08:00 UTC (32,818 KB)

Computer Science > Machine Learning

Title:Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators