Skip to main content

Showing 1–50 of 77 results for author: Choshen, L

.
  1. arXiv:2504.11442  [pdf, other

    cs.CL cs.AI cs.LG cs.MA

    TextArena

    Authors: Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan

    Abstract: TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scor… ▽ More

    Submitted 24 May, 2025; v1 submitted 15 April, 2025; originally announced April 2025.

    Comments: Work in progress; 5 pages, 3 figures

  2. Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

    Authors: Alex Warstadt, Aaron Mueller, Leshem Choshen, Ethan Wilcox, Chengxu Zhuang, Juan Ciro, Rafael Mosquera, Bhargavi Paranjape, Adina Williams, Tal Linzen, Ryan Cotterell

    Abstract: Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data and still do not perform as well as humans on many evaluations. These intensive resource demands limit the ability of researchers to train new models and use existing models as developmentally plausible cognitive mod… ▽ More

    Submitted 10 April, 2025; originally announced April 2025.

    Comments: Published in Proceedings of BabyLM. Please cite the published version on ACL anthology: http://aclanthology.org/2023.conll-babylm.1/

    Journal ref: 2023. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1-34, Singapore. Association for Computational Linguistics

  3. arXiv:2504.05523  [pdf, other

    cs.CL

    Pretraining Language Models for Diachronic Linguistic Change Discovery

    Authors: Elisabeth Fittschen, Sabrina Li, Tom Lippincott, Leshem Choshen, Craig Messner

    Abstract: Large language models (LLMs) have shown potential as tools for scientific discovery. This has engendered growing interest in their use in humanistic disciplines, such as historical linguistics and literary studies. These fields often construct arguments on the basis of delineations like genre, or more inflexibly, time period. Although efforts have been made to restrict inference to specific domain… ▽ More

    Submitted 9 April, 2025; v1 submitted 7 April, 2025; originally announced April 2025.

  4. arXiv:2503.13507  [pdf, other

    cs.CL cs.AI

    NeurIPS 2023 LLM Efficiency Fine-tuning Competition

    Authors: Mark Saroufim, Yotam Perlitz, Leshem Choshen, Luca Antiga, Greg Bowyer, Christian Puhrsch, Driss Guessous, Supriya Rao, Geeta Chauhan, Ashvini Kumar, Jindal Pawan Kumar, Rajpoot Ankur Parikh, Joe Isaacson, Weiwei Yang

    Abstract: Our analysis of the NeurIPS 2023 large language model (LLM) fine-tuning competition revealed the following trend: top-performing models exhibit significant overfitting on benchmark datasets, mirroring the broader issue of benchmark overfitting on popular leaderboards and that data curation is essential in order to get a high performing LLM. The competition, which consisted of two stages - an open… ▽ More

    Submitted 13 March, 2025; originally announced March 2025.

    Comments: 11 pages, 10 figures

  5. arXiv:2503.01622  [pdf, ps, other

    cs.CL

    DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

    Authors: Eliya Habba, Ofir Arviv, Itay Itzhak, Yotam Perlitz, Elron Bandel, Leshem Choshen, Michal Shmueli-Scheuer, Gabriel Stanovsky

    Abstract: Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to pr… ▽ More

    Submitted 3 June, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  6. arXiv:2502.19412  [pdf, other

    cs.CL

    The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

    Authors: Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer

    Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of… ▽ More

    Submitted 2 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  7. arXiv:2502.10645  [pdf, other

    cs.CL

    BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop

    Authors: Lucas Charpentier, Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Michael Hu, Jaap Jumelet, Tal Linzen, Jing Liu, Aaron Mueller, Candace Ross, Raj Sanjay Shah, Alex Warstadt, Ethan Wilcox, Adina Williams

    Abstract: BabyLM aims to dissolve the boundaries between cognitive modeling and language modeling. We call for both workshop papers and for researchers to join the 3rd BabyLM competition. As in previous years, we call for participants in the data-efficient pretraining challenge in the general track. This year, we also offer a new track: INTERACTION. This new track encourages interactive behavior, learning f… ▽ More

    Submitted 24 February, 2025; v1 submitted 14 February, 2025; originally announced February 2025.

    Comments: EMNLP 2025 BabyLM Workshop. arXiv admin note: text overlap with arXiv:2404.06214

  8. arXiv:2412.06540  [pdf, other

    cs.LG cs.AI stat.ML

    Sloth: scaling laws for LLM skills to predict multi-benchmark performance across families

    Authors: Felipe Maia Polo, Seamus Somerstep, Leshem Choshen, Yuekai Sun, Mikhail Yurochkin

    Abstract: Scaling laws for large language models (LLMs) predict model performance based on parameters like size and training data. However, differences in training configurations and data processing across model families lead to significant variations in benchmark performance, making it difficult for a single scaling law to generalize across all LLMs. On the other hand, training family-specific scaling laws… ▽ More

    Submitted 4 February, 2025; v1 submitted 9 December, 2024; originally announced December 2024.

  9. arXiv:2412.05149  [pdf, other

    cs.CL

    Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

    Authors: Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, Alex Warstadt, Ethan Gotlieb Wilcox

    Abstract: The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language mo… ▽ More

    Submitted 6 December, 2024; originally announced December 2024.

  10. arXiv:2412.03304  [pdf, other

    cs.CL

    Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

    Authors: Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David I. Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Wei-Yin Ko, Sebastian Ruder, Madeline Smith, Antoine Bosselut, Alice Oh, Andre F. T. Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, Sara Hooker

    Abstract: Cultural biases in multilingual datasets pose significant challenges for their effectiveness as global benchmarks. These biases stem not only from differences in language but also from the cultural knowledge required to interpret questions, reducing the practical utility of translated datasets like MMLU. Furthermore, translation often introduces artefacts that can distort the meaning or clarity of… ▽ More

    Submitted 19 February, 2025; v1 submitted 4 December, 2024; originally announced December 2024.

  11. arXiv:2411.05239  [pdf, ps, other

    cs.LG cs.IT

    ZipNN: Lossless Compression for AI Models

    Authors: Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmonsky, Roy Leibovitz, Ilias Ennmouri, Michal Malka, Peter Chin, Swaminathan Sundararaman, Danny Harnik

    Abstract: With the growth of model sizes and the scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast model compression literature deleting parts of the model weights for faster inference, we investigate a more traditional type of compression - one that represents the model in a compact form and is coupled… ▽ More

    Submitted 4 June, 2025; v1 submitted 7 November, 2024; originally announced November 2024.

    Comments: IEEE Cloud. arXiv admin note: text overlap with arXiv:2404.15198

  12. arXiv:2410.19735  [pdf, other

    cs.CV

    Model merging with SVD to tie the Knots

    Authors: George Stoica, Pratik Ramesh, Boglarka Ecsedi, Leshem Choshen, Judy Hoffman

    Abstract: Recent model merging methods demonstrate that the parameters of fully-finetuned models specializing in distinct tasks can be combined into one model capable of solving all tasks without retraining. Yet, this success does not transfer well when merging LoRA finetuned models. We study this phenomenon and observe that the weights of LoRA finetuned models showcase a lower degree of alignment compared… ▽ More

    Submitted 25 October, 2024; originally announced October 2024.

  13. arXiv:2410.11840  [pdf, ps, other

    cs.LG cs.AI cs.CL

    A Hitchhiker's Guide to Scaling Law Estimation

    Authors: Leshem Choshen, Yang Zhang, Jacob Andreas

    Abstract: Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pretraining decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language mode… ▽ More

    Submitted 2 June, 2025; v1 submitted 15 October, 2024; originally announced October 2024.

    Comments: ICML

  14. arXiv:2408.16961  [pdf, other

    cs.HC cs.AI

    The Future of Open Human Feedback

    Authors: Shachar Don-Yehiya, Ben Burtenshaw, Ramon Fernandez Astudillo, Cailean Osborne, Mimansa Jaiswal, Tzu-Sheng Kuo, Wenting Zhao, Idan Shenfeld, Andi Peng, Mikhail Yurochkin, Atoosa Kasirzadeh, Yangsibo Huang, Tatsunori Hashimoto, Yacine Jernite, Daniel Vila-Suero, Omri Abend, Jennifer Ding, Sara Hooker, Hannah Rose Kirk, Leshem Choshen

    Abstract: Human feedback on conversations with language language models (LLMs) is central to how these systems learn about the world, improve their capabilities, and are steered toward desirable and safe behaviors. However, this feedback is mostly collected by frontier AI labs and kept behind closed doors. In this work, we bring together interdisciplinary experts to assess the opportunities and challenges t… ▽ More

    Submitted 4 September, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

  15. arXiv:2408.12259  [pdf, other

    cs.AI

    How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability

    Authors: Ora Nova Fandina, Leshem Choshen, Eitan Farchi, George Kour, Yotam Perlitz, Orna Raz

    Abstract: Consider a scenario where a harmfulness evaluation metric intended to filter unsafe responses from a Large Language Model. When applied to individual harmful prompt-response pairs, it correctly flags them as unsafe by assigning a high-risk score. Yet, if those same pairs are concatenated, the metrics decision unexpectedly reverses - labelling the combined content as safe with a low score, allowing… ▽ More

    Submitted 12 February, 2025; v1 submitted 22 August, 2024; originally announced August 2024.

    MSC Class: 68T50

  16. arXiv:2408.10646  [pdf, other

    cs.CL cs.AI

    Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs

    Authors: Maxim Ifergan, Leshem Choshen, Roee Aharoni, Idan Szpektor, Omri Abend

    Abstract: The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model's ability to answer a query consistently across langu… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

  17. arXiv:2408.08291  [pdf, other

    cs.CL

    The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community

    Authors: Shachar Don-Yehiya, Leshem Choshen, Omri Abend

    Abstract: Human-model conversations provide a window into users' real-world scenarios, behavior, and needs, and thus are a valuable resource for model development and research. While for-profit companies collect user data through the APIs of their models, using it internally to improve their own models, the open source and research community lags behind. We introduce the ShareLM collection, a unified set… ▽ More

    Submitted 3 March, 2025; v1 submitted 15 August, 2024; originally announced August 2024.

  18. arXiv:2408.07057  [pdf, ps, other

    cs.LG cs.AI cs.CL

    A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning

    Authors: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Lucas Caccia, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Alessandro Sordoni

    Abstract: The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. Model MoErging methods aim to recycle expert models to create an aggregate system with improved performance or generalization. A key component of MoErging methods is the creation of a router that decides which expert model(s) to use for a par… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: 26 pages

  19. arXiv:2407.21530  [pdf, other

    cs.CL cs.LG

    Data Contamination Report from the 2024 CONDA Shared Task

    Authors: Oscar Sainz, Iker GarcĂ­a-Ferrero, Alon Jacovi, Jon Ander Campos, Yanai Elazar, Eneko Agirre, Yoav Goldberg, Wei-Lin Chen, Jenny Chim, Leshem Choshen, Luca D'Amico-Wong, Melissa Dell, Run-Ze Fan, Shahriar Golchin, Yucheng Li, Pengfei Liu, Bhavish Pahwa, Ameya Prabhu, Suryansh Sharma, Emily Silcock, Kateryna Solonko, David Stap, Mihai Surdeanu, Yu-Min Tseng, Vishaal Udandarao , et al. (3 additional authors not shown)

    Abstract: The 1st Workshop on Data Contamination (CONDA 2024) focuses on all relevant aspects of data contamination in natural language processing, where data contamination is understood as situations where evaluation data is included in pre-training corpora used to train large scale models, compromising evaluation results. The workshop fostered a shared task to collect evidence on data contamination in cur… ▽ More

    Submitted 4 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

    Comments: https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database

  20. arXiv:2407.13696  [pdf, other

    cs.CL

    Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

    Authors: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen

    Abstract: Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank c… ▽ More

    Submitted 12 September, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Under Review

  21. arXiv:2407.10944  [pdf, other

    cs.CL

    Naturally Occurring Feedback is Common, Extractable and Useful

    Authors: Shachar Don-Yehiya, Leshem Choshen, Omri Abend

    Abstract: Human feedback data is a critical component in developing language models. However, collecting this feedback is costly and ultimately not scalable. Inspired by the way human interlocutors provide spontaneous unsolicited feedback to each other, we propose to extract feedback that users naturally include when interacting with chat models. We manually annotated conversations to confirm the presence o… ▽ More

    Submitted 3 March, 2025; v1 submitted 15 July, 2024; originally announced July 2024.

  22. arXiv:2407.00066  [pdf, ps, other

    cs.DC cs.AI cs.CL cs.LG

    Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

    Authors: Rickard BrĂĽel-Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen, Kristjan Greenewald, Mikhail Yurochkin, Justin Solomon

    Abstract: Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and… ▽ More

    Submitted 29 May, 2025; v1 submitted 17 June, 2024; originally announced July 2024.

  23. arXiv:2405.17202  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    Efficient multi-prompt evaluation of LLMs

    Authors: Felipe Maia Polo, Ronald Xu, Lucas Weber, MĂ­rian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, Mikhail Yurochkin

    Abstract: Most popular benchmarks for comparing LLMs rely on a limited set of prompt templates, which may not fully capture the LLMs' abilities and can affect the reproducibility of results on leaderboards. Many recent works empirically verify prompt sensitivity and advocate for changes in LLM evaluation. In this paper, we consider the problem of estimating the performance distribution across many prompt va… ▽ More

    Submitted 30 October, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

    Comments: NeurIPS 2024

  24. arXiv:2405.09605  [pdf, other

    cs.CL cs.AI cs.LG

    Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models

    Authors: Anna A. Ivanova, Aalok Sathe, Benjamin Lipkin, Unnathi Kumar, Setayesh Radkani, Thomas H. Clark, Carina Kauf, Jennifer Hu, R. T. Pramod, Gabriel Grand, Vivian Paulun, Maria Ryskina, Ekin AkyĂĽrek, Ethan Wilcox, Nafisa Rashid, Leshem Choshen, Roger Levy, Evelina Fedorenko, Joshua Tenenbaum, Jacob Andreas

    Abstract: The ability to build and leverage world models is essential for a general-purpose AI agent. Testing such capabilities is hard, in part because the building blocks of world models are ill-defined. We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in language models by testing their ability to use knowledge of a concept to match a target text with a plausible/i… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: 21 pages (11 main), 7 figures. Authors Anna Ivanova, Aalok Sathe, Benjamin Lipkin contributed equally

  25. arXiv:2404.18923  [pdf, other

    cs.CL

    Holmes: A Benchmark to Assess the Linguistic Competence of Language Models

    Authors: Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, Iryna Gurevych

    Abstract: We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from o… ▽ More

    Submitted 22 October, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

  26. arXiv:2404.15198  [pdf, other

    cs.LG cs.IT

    Lossless and Near-Lossless Compression for Foundation Models

    Authors: Moshik Hershcovitch, Leshem Choshen, Andrew Wood, Ilias Enmouri, Peter Chin, Swaminathan Sundararaman, Danny Harnik

    Abstract: With the growth of model sizes and scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast literature about reducing model sizes, we investigate a more traditional type of compression -- one that compresses the model to a smaller form and is coupled with a decompression algorithm that returns it to i… ▽ More

    Submitted 5 April, 2024; originally announced April 2024.

  27. arXiv:2404.06214  [pdf, other

    cs.CL

    [Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

    Authors: Leshem Choshen, Ryan Cotterell, Michael Y. Hu, Tal Linzen, Aaron Mueller, Candace Ross, Alex Warstadt, Ethan Wilcox, Adina Williams, Chengxu Zhuang

    Abstract: After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-… ▽ More

    Submitted 27 July, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

  28. arXiv:2404.00459  [pdf, other

    cs.CL

    NumeroLogic: Number Encoding for Enhanced LLMs' Numerical Reasoning

    Authors: Eli Schwartz, Leshem Choshen, Joseph Shtok, Sivan Doveh, Leonid Karlinsky, Assaf Arbelle

    Abstract: Language models struggle with handling numerical data and performing arithmetic operations. We hypothesize that this limitation can be partially attributed to non-intuitive textual numbers representation. When a digit is read or generated by a causal language model it does not know its place value (e.g. thousands vs. hundreds) until the entire number is processed. To address this issue, we propose… ▽ More

    Submitted 26 September, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

  29. arXiv:2402.16842  [pdf, other

    cs.LG

    Asymmetry in Low-Rank Adapters of Foundation Models

    Authors: Jiacheng Zhu, Kristjan Greenewald, Kimia Nadjahi, Haitz Sáez de Ocáriz Borde, Rickard Brüel Gabrielsson, Leshem Choshen, Marzyeh Ghassemi, Mikhail Yurochkin, Justin Solomon

    Abstract: Parameter-efficient fine-tuning optimizes large, pre-trained foundation models by updating a subset of parameters; in this class, Low-Rank Adaptation (LoRA) is particularly effective. Inspired by an effort to investigate the different roles of LoRA matrices during fine-tuning, this paper characterizes and leverages unexpected asymmetry in the importance of low-rank adapter matrices. Specifically,… ▽ More

    Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

    Comments: 17 pages, 2 figures, 9 tables

  30. arXiv:2402.14992  [pdf, other

    cs.CL cs.AI cs.LG stat.ML

    tinyBenchmarks: evaluating LLMs with fewer examples

    Authors: Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin

    Abstract: The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very expensive. In this paper, we investigate strategies to reduce the number of evaluations needed to assess the performance of an LLM on several key benchmarks. F… ▽ More

    Submitted 26 May, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: Proceedings of the 41st International Conference on Machine Learning (ICML)

  31. arXiv:2402.07891  [pdf, other

    cs.CL cs.LG

    Label-Efficient Model Selection for Text Generation

    Authors: Shir Ashury-Tahan, Ariel Gera, Benjamin Sznajder, Leshem Choshen, Liat Ein-Dor, Eyal Shnarch

    Abstract: Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluatio… ▽ More

    Submitted 6 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL (main conference)

  32. arXiv:2401.14367  [pdf, other

    cs.CL cs.AI cs.LG

    Genie: Achieving Human Parity in Content-Grounded Datasets Generation

    Authors: Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen

    Abstract: The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose Genie, a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: Accepted to ICLR24

  33. arXiv:2401.14019  [pdf, other

    cs.CL cs.AI

    Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

    Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz

    Abstract: In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we p… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: Submitted to NAACL demo track

  34. arXiv:2401.08574  [pdf, other

    cs.CL

    Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability

    Authors: Afra Feyza AkyĂĽrek, Ekin AkyĂĽrek, Leshem Choshen, Derry Wijaya, Jacob Andreas

    Abstract: While language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. As a consequence, current LMs also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. We present a method called Deductive Closure Training (DCT) that use… ▽ More

    Submitted 26 June, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

    Comments: ACL Findings

  35. arXiv:2311.13171  [pdf, other

    cs.LG cs.AI cs.CL

    ComPEFT: Compression for Communicating Parameter Efficient Updates via Sparsification and Quantization

    Authors: Prateek Yadav, Leshem Choshen, Colin Raffel, Mohit Bansal

    Abstract: Parameter-efficient fine-tuning (PEFT) techniques make it possible to efficiently adapt a language model to create "expert" models that specialize to new tasks or domains. Recent techniques in model merging and compositional generalization leverage these expert models by dynamically composing modules to improve zero/few-shot generalization. Despite the efficiency of PEFT methods, the size of exper… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 25 Pages, 6 Figures, 16 Tables

  36. arXiv:2311.12131  [pdf, other

    cs.CL

    Human Learning by Model Feedback: The Dynamics of Iterative Prompting with Midjourney

    Authors: Shachar Don-Yehiya, Leshem Choshen, Omri Abend

    Abstract: Generating images with a Text-to-Image model often requires multiple trials, where human users iteratively update their prompt based on feedback, namely the output image. Taking inspiration from cognitive work on reference games and dialogue alignment, this paper analyzes the dynamics of the user prompts along such iterations. We compile a dataset of iterative interactions of human users with Midj… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: EMNLP23

  37. arXiv:2311.07682  [pdf, other

    cs.CL cs.AI cs.LG

    Fuse to Forget: Bias Reduction and Selective Memorization through Model Fusion

    Authors: Kerem Zaman, Leshem Choshen, Shashank Srivastava

    Abstract: Model fusion research aims to aggregate the knowledge of multiple individual models to enhance performance by combining their weights. In this work, we study the inverse problem: investigating whether model fusion can be used to reduce unwanted knowledge. We investigate the effects of model fusion in three scenarios: the learning of shortcuts, social biases, and memorization of training data in fi… ▽ More

    Submitted 9 October, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

    Comments: 21 pages, 12 figures, 7 tables; To appear at EMNLP 2024

  38. arXiv:2308.11696  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Efficient Benchmarking of Language Models

    Authors: Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, Leshem Choshen

    Abstract: The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present t… ▽ More

    Submitted 1 April, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: Accepted to NAACL main track

  39. arXiv:2306.01708  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    TIES-Merging: Resolving Interference When Merging Models

    Authors: Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, Mohit Bansal

    Abstract: Transfer learning - i.e., further fine-tuning a pre-trained model on a downstream task - can confer significant advantages, including improved downstream performance, faster convergence, and better sample efficiency. These advantages have led to a proliferation of task-specific fine-tuned models, which typically can only perform a single task and do not benefit from one another. Recently, model me… ▽ More

    Submitted 26 October, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Published at NeurIPS 2023, 23 Pages, 13 Figures, 14 Tables

  40. arXiv:2305.14991  [pdf, other

    cs.CL cs.AI

    MuLER: Detailed and Scalable Reference-based Evaluation

    Authors: Taelin Karidi, Leshem Choshen, Gal Patel, Omri Abend

    Abstract: We propose a novel methodology (namely, MuLER) that transforms any reference-based evaluation metric for text generation, such as machine translation (MT) into a fine-grained analysis tool. Given a system and a metric, MuLER quantifies how much the chosen metric penalizes specific error types (e.g., errors in translating names of locations). MuLER thus enables a detailed error analysis which can l… ▽ More

    Submitted 29 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  41. arXiv:2303.09435  [pdf, other

    cs.CL

    Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

    Authors: Alexander Yom Din, Taelin Karidi, Leshem Choshen, Mor Geva

    Abstract: Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its intermediate representations. One way to elucidate this is to cast the hidden representations as final representations, bypassing the transformer computation in… ▽ More

    Submitted 18 June, 2024; v1 submitted 16 March, 2023; originally announced March 2023.

    Journal ref: LREC-COLING 2024

  42. arXiv:2302.04863  [pdf, other

    cs.LG cs.AI cs.CL

    Knowledge is a Region in Weight Space for Fine-tuned Language Models

    Authors: Almog Gueta, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz, Leshem Choshen

    Abstract: Research on neural networks has focused on understanding a single model trained on a single dataset. However, relatively little is known about the relationships between different models, particularly those trained or tested on different datasets. We address this by studying how the weight space and the underlying loss landscape of different models are interconnected. Specifically, we demonstrate… ▽ More

    Submitted 12 October, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

  43. arXiv:2301.11796  [pdf, other

    cs.CL

    Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

    Authors: Alex Warstadt, Leshem Choshen, Aaron Mueller, Adina Williams, Ethan Wilcox, Chengxu Zhuang

    Abstract: We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size… ▽ More

    Submitted 27 January, 2023; originally announced January 2023.

  44. arXiv:2212.01378  [pdf, other

    cs.LG cs.CL cs.DC

    ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning

    Authors: Shachar Don-Yehiya, Elad Venezian, Colin Raffel, Noam Slonim, Yoav Katz, Leshem Choshen

    Abstract: We propose a new paradigm to continually evolve pretrained models, denoted ColD Fusion. It provides the benefits of multitask learning but leverages distributed computation with limited communication and eliminates the need for shared data. Consequentially, ColD Fusion can give rise to a synergistic loop, where finetuned models can be recycled to continually improve the pretrained model they are b… ▽ More

    Submitted 13 September, 2023; v1 submitted 2 December, 2022; originally announced December 2022.

    Comments: ACL 23

  45. arXiv:2211.05655  [pdf, other

    cs.CL cs.AI cs.LG

    DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

    Authors: Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, Omri Abend

    Abstract: Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA mo… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: 12 pages, 2 figures

  46. arXiv:2211.00107  [pdf, other

    cs.CL cs.AI cs.LG

    Where to start? Analyzing the potential value of intermediate models

    Authors: Leshem Choshen, Elad Venezian, Shachar Don-Yehia, Noam Slonim, Yoav Katz

    Abstract: Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this intertraining scheme, over a wide range of English classification tasks. Surprisingly, our analysis su… ▽ More

    Submitted 10 November, 2022; v1 submitted 31 October, 2022; originally announced November 2022.

    Comments: https://ibm.github.io/model-recycling/

  47. arXiv:2210.03053  [pdf, other

    cs.CL cs.AI cs.LG

    Reinforcement Learning with Large Action Spaces for Neural Machine Translation

    Authors: Asaf Yehudai, Leshem Choshen, Lior Fox, Omri Abend

    Abstract: Applying Reinforcement learning (RL) following maximum likelihood estimation (MLE) pre-training is a versatile method for enhancing neural machine translation (NMT) performance. However, recent work has argued that the gains produced by RL for NMT are mostly due to promoting tokens that have already received a fairly high probability in pre-training. We hypothesize that the large action space is a… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

    Comments: Accepted for Coling

  48. arXiv:2208.01483  [pdf, other

    cs.CL cs.HC

    Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

    Authors: Eyal Shnarch, Alon Halfon, Ariel Gera, Marina Danilevsky, Yannis Katsis, Leshem Choshen, Martin Santillan Cooper, Dina Epelboim, Zheng Zhang, Dakuo Wang, Lucy Yip, Liat Ein-Dor, Lena Dankin, Ilya Shnayderman, Ranit Aharonov, Yunyao Li, Naftali Liberman, Philip Levin Slesarev, Gwilym Newton, Shila Ofek-Koifman, Noam Slonim, Yoav Katz

    Abstract: Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) be… ▽ More

    Submitted 31 October, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

    Comments: 7 pages, 2 figures To be published at EMNLP 2022

  49. arXiv:2205.09178  [pdf, other

    cs.CL cs.LG

    PreQuEL: Quality Estimation of Machine Translation Outputs in Advance

    Authors: Shachar Don-Yehiya, Leshem Choshen, Omri Abend

    Abstract: We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A PreQuEL system predicts how well a given sentence will be translated, without recourse to the actual translation, thus eschewing unnecessary resource allocation when translation quality is bound to be low. PreQuEL can be defined relative to a given MT system (e.g., some industry service) or generally relative to the state-of-the-… ▽ More

    Submitted 4 December, 2022; v1 submitted 18 May, 2022; originally announced May 2022.

    Comments: Accepted to the main conference of EMNLP 2022

  50. arXiv:2205.05730  [pdf, other

    cs.CL cs.AI cs.CY

    Some Grammatical Errors are Frequent, Others are Important

    Authors: Leshem Choshen, Ofir Shifman, Omri Abend

    Abstract: In Grammatical Error Correction, systems are evaluated by the number of errors they correct. However, no one has assessed whether all error types are equally important. We provide and apply a method to quantify the importance of different grammatical error types to humans. We show that some rare errors are considered disturbing while other common ones are not. This affects possible directions to i… ▽ More

    Submitted 11 May, 2022; originally announced May 2022.