Search | arXiv e-print repository

Paloma: A Benchmark for Evaluating Language Model Fit

Authors: Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy, Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge

Abstract: Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolate… ▽ More Evaluations of language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains--varying distributions of language. We introduce Perplexity Analysis for Language Model Assessment (Paloma), a benchmark to measure LM fit to 546 English and code domains, instead of assuming perplexity on one distribution extrapolates to others. We include two new datasets of the top 100 subreddits (e.g., r/depression on Reddit) and programming languages (e.g., Java on GitHub), both sources common in contemporary LMs. With our benchmark, we release 6 baseline 1B LMs carefully controlled to provide fair comparisons about which pretraining corpus is best and code for others to apply those controls to their own experiments. Our case studies demonstrate how the fine-grained results from Paloma surface findings such as that models pretrained without data beyond Common Crawl exhibit anomalous gaps in LM fit to many domains or that loss is dominated by the most frequently occurring strings in the vocabulary. △ Less

Submitted 7 December, 2024; v1 submitted 16 December, 2023; originally announced December 2023.

Comments: Conference: NeurIPS 2024, Project Page: https://paloma.allen.ai/

arXiv:2307.09701 [pdf, other]

Efficiency Pentathlon: A Standardized Arena for Efficiency Evaluation

Authors: Hao Peng, Qingqing Cao, Jesse Dodge, Matthew E. Peters, Jared Fernandez, Tom Sherborne, Kyle Lo, Sam Skjonsberg, Emma Strubell, Darrell Plessas, Iz Beltagy, Evan Pete Walsh, Noah A. Smith, Hannaneh Hajishirzi

Abstract: Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across diffe… ▽ More Rising computational demands of modern natural language processing (NLP) systems have increased the barrier to entry for cutting-edge research while posing serious environmental concerns. Yet, progress on model efficiency has been impeded by practical challenges in model evaluation and comparison. For example, hardware is challenging to control due to disparate levels of accessibility across different institutions. Moreover, improvements in metrics such as FLOPs often fail to translate to progress in real-world applications. In response, we introduce Pentathlon, a benchmark for holistic and realistic evaluation of model efficiency. Pentathlon focuses on inference, which accounts for a majority of the compute in a model's lifecycle. It offers a strictly-controlled hardware platform, and is designed to mirror real-world applications scenarios. It incorporates a suite of metrics that target different aspects of efficiency, including latency, throughput, memory overhead, and energy consumption. Pentathlon also comes with a software library that can be seamlessly integrated into any codebase and enable evaluation. As a standardized and centralized evaluation platform, Pentathlon can drastically reduce the workload to make fair and reproducible efficiency comparisons. While initially focused on natural language processing (NLP) models, Pentathlon is designed to allow flexible extension to other fields. We envision Pentathlon will stimulate algorithmic innovations in building efficient models, and foster an increased awareness of the social and environmental implications in the development of future-generation NLP models. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2305.14864 [pdf, other]

Just CHOP: Embarrassingly Simple LLM Compression

Authors: Ananya Harsh Jha, Tom Sherborne, Evan Pete Walsh, Dirk Groeneveld, Emma Strubell, Iz Beltagy

Abstract: Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. A growing assortment of methods for compression promises to reduce the computational burden of LLMs in deployment, but so far, only quantization approaches have been demonstrated to be effective for LLM compression while maintaining zero-shot performance. A critical ste… ▽ More Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. A growing assortment of methods for compression promises to reduce the computational burden of LLMs in deployment, but so far, only quantization approaches have been demonstrated to be effective for LLM compression while maintaining zero-shot performance. A critical step in the compression process, the pretrain-then-finetune paradigm, has largely been overlooked when adapting existing pruning strategies to LLMs or proposing new ones. In this work, we show that embarrassingly simple layer pruning coupled with an extended language model pretraining as the finetuning phase produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale while being more inference efficient. We call this method LayerChop, where we deterministically remove layers from a model followed by task-agnostic finetuning of the remaining weights by continued self-supervised pretraining. At this scale, we also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique. △ Less

Submitted 9 July, 2024; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 13 pages, 6 figures, 6 tables

arXiv:2211.14959 [pdf, other]

Method for in-solution, high-throughput T1 relaxometry using fluorescent nanodiamonds

Authors: Erin. S. Grant, Mina Barzegar Amiri Olia, Ella. P. Walsh, Liam T. Hall, Gawain McColl, David A. Simpson

Abstract: Fluorescent nanodiamonds (FNDs) have been exploited as sensitive quantum probes for nanoscale chemical and biological sensing applications, with the majority of demonstrations to date relying on the detection of single FNDs. This places significant limits on the measurement time, throughput and statistical significance of a measured result as there is usually marked inhomogeneity within FND sample… ▽ More Fluorescent nanodiamonds (FNDs) have been exploited as sensitive quantum probes for nanoscale chemical and biological sensing applications, with the majority of demonstrations to date relying on the detection of single FNDs. This places significant limits on the measurement time, throughput and statistical significance of a measured result as there is usually marked inhomogeneity within FND samples. Here we have developed a measurement platform that can report the T1 spin relaxation time from a large ensemble of FNDs in solution. We first describe a refined sensing protocol for this modality and then use it to identify the optimal FND size for the detection of paramagnetic targets. Our approach is simple to set up, robust and can be used for rapid material characterisation or a variety of in-situ quantum sensing applications. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: 8 pages, 3 figures

Showing 1–4 of 4 results for author: Walsh, E P