Paloma: A Benchmark for Evaluating Language Model Fit

Magnusson, Ian; Bhagia, Akshita; Hofmann, Valentin; Soldaini, Luca; Jha, Ananya Harsh; Tafjord, Oyvind; Schwenk, Dustin; Walsh, Evan Pete; Elazar, Yanai; Lo, Kyle; Groeneveld, Dirk; Beltagy, Iz; Hajishirzi, Hannaneh; Smith, Noah A.; Richardson, Kyle; Dodge, Jesse

Computer Science > Computation and Language

arXiv:2312.10523 (cs)

[Submitted on 16 Dec 2023 (v1), last revised 7 Dec 2024 (this version, v2)]

Title:Paloma: A Benchmark for Evaluating Language Model Fit

Authors:Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge

View PDF HTML (experimental)

Abstract:Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. We include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs. With our benchmark, we release 6 baseline 1B LMs carefully controlled to provide fair comparisons about which pretraining corpus is best and code for others to apply those controls to their own experiments. Our case studies demonstrate how the fine-grained results from Paloma surface findings such as that models pretrained without data beyond Common Crawl exhibit anomalous gaps in LM fit to many domains or that loss is dominated by the most frequently occurring strings in the vocabulary.

Comments:	Conference: NeurIPS 2024, Project Page: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2312.10523 [cs.CL]
	(or arXiv:2312.10523v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2312.10523

Submission history

From: Ian Magnusson [view email]
[v1] Sat, 16 Dec 2023 19:12:45 UTC (5,893 KB)
[v2] Sat, 7 Dec 2024 20:22:22 UTC (5,895 KB)

Computer Science > Computation and Language

Title:Paloma: A Benchmark for Evaluating Language Model Fit

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Paloma: A Benchmark for Evaluating Language Model Fit

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators