Skip to main content

Showing 1–17 of 17 results for author: Shmueli-Scheuer, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.19621  [pdf, ps, other

    cs.AI cs.CL

    Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models

    Authors: George Kour, Itay Nakash, Ateret Anaby-Tavor, Michal Shmueli-Scheuer

    Abstract: As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially rein… ▽ More

    Submitted 26 May, 2025; originally announced May 2025.

  2. arXiv:2503.16416  [pdf, other

    cs.AI cs.CL cs.LG

    Survey on Evaluation of LLM-based Agents

    Authors: Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer

    Abstract: The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensio… ▽ More

    Submitted 20 March, 2025; originally announced March 2025.

  3. arXiv:2503.01622  [pdf, ps, other

    cs.CL

    DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation

    Authors: Eliya Habba, Ofir Arviv, Itay Itzhak, Yotam Perlitz, Elron Bandel, Leshem Choshen, Michal Shmueli-Scheuer, Gabriel Stanovsky

    Abstract: Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to pr… ▽ More

    Submitted 3 June, 2025; v1 submitted 3 March, 2025; originally announced March 2025.

  4. arXiv:2502.19412  [pdf, other

    cs.CL

    The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

    Authors: Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer

    Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of… ▽ More

    Submitted 2 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  5. arXiv:2407.18990  [pdf, other

    cs.LG cs.AI cs.CL

    Stay Tuned: An Empirical Study of the Impact of Hyperparameters on LLM Tuning in Real-World Applications

    Authors: Alon Halfon, Shai Gretz, Ofir Arviv, Artem Spector, Orith Toledo-Ronen, Yoav Katz, Liat Ein-Dor, Michal Shmueli-Scheuer, Noam Slonim

    Abstract: Fine-tuning Large Language Models (LLMs) is an effective method to enhance their performance on downstream tasks. However, choosing the appropriate setting of tuning hyperparameters (HPs) is a labor-intensive and computationally expensive process. Here, we provide recommended HP configurations for practical use-cases that represent a better starting point for practitioners, when considering two SO… ▽ More

    Submitted 7 August, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

  6. arXiv:2407.13696  [pdf, other

    cs.CL

    Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

    Authors: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen

    Abstract: Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank c… ▽ More

    Submitted 12 September, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Under Review

  7. arXiv:2401.14019  [pdf, other

    cs.CL cs.AI

    Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

    Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz

    Abstract: In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we p… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: Submitted to NAACL demo track

  8. arXiv:2308.11696  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Efficient Benchmarking of Language Models

    Authors: Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, Leshem Choshen

    Abstract: The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present t… ▽ More

    Submitted 1 April, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: Accepted to NAACL main track

  9. arXiv:2305.15040  [pdf, other

    cs.CL

    Active Learning for Natural Language Generation

    Authors: Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-Dor

    Abstract: The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and time-consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. Howe… ▽ More

    Submitted 17 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP2023 as a long paper

  10. arXiv:2211.04417  [pdf, other

    cs.CL

    nBIIG: A Neural BI Insights Generation System for Table Reporting

    Authors: Yotam Perlitz, Dafna Sheinwald, Noam Slonim, Michal Shmueli-Scheuer

    Abstract: We present nBIIG, a neural Business Intelligence (BI) Insights Generation system. Given a table, our system applies various analyses to create corresponding RDF representations, and then uses a neural model to generate fluent textual insights out of these representations. The generated insights can be used by an analyst, via a human-in-the-loop paradigm, to enhance the task of creating compelling… ▽ More

    Submitted 8 November, 2022; originally announced November 2022.

    Comments: Accepted to AAAI-23

  11. arXiv:2205.10938  [pdf, other

    cs.CL

    Diversity Enhanced Table-to-Text Generation via Type Control

    Authors: Yotam Perlitz, Liat Ein-Dor, Dafna Sheinwald, Noam Slonim, Michal Shmueli-Scheuer

    Abstract: Generating natural language statements to convey logical inferences from tabular data (i.e., Logical NLG) is a process with one input and a variety of valid outputs. This characteristic underscores the need for a method to produce a diverse set of valid outputs, presenting different perspectives of the input data. We propose a simple yet effective diversity-enhancing scheme that builds upon an inh… ▽ More

    Submitted 30 May, 2023; v1 submitted 22 May, 2022; originally announced May 2022.

    Comments: 4 pages, 4 figures

  12. arXiv:2203.10940  [pdf, other

    cs.CL

    Quality Controlled Paraphrase Generation

    Authors: Elron Bandel, Ranit Aharonov, Michal Shmueli-Scheuer, Ilya Shnayderman, Noam Slonim, Liat Ein-Dor

    Abstract: Paraphrase generation has been widely used in various downstream tasks. Most tasks benefit mainly from high quality paraphrases, namely those that are semantically similar to, yet linguistically diverse from, the original sentence. Generating high-quality paraphrases is challenging as it becomes increasingly hard to preserve meaning as linguistic diversity increases. Recent works achieve nice resu… ▽ More

    Submitted 1 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Accepted as a long paper at ACL 2022

  13. arXiv:2110.03179  [pdf, other

    cs.CL

    HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles

    Authors: Odellia Boni, Guy Feigenblat, Guy Lev, Michal Shmueli-Scheuer, Benjamin Sznajder, David Konopnicki

    Abstract: We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS), which targets the use-case of generating actionable instructions from a set of sources. This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios. We employed automatic method… ▽ More

    Submitted 8 October, 2021; v1 submitted 7 October, 2021; originally announced October 2021.

    Comments: 8 pages, 4 figures, 5 tables. HowSumm dataset is publicly available at \url{https://ibm.biz/BdfhzH}

  14. arXiv:2009.01460  [pdf, ps, other

    cs.CL

    orgFAQ: A New Dataset and Analysis on Organizational FAQs and User Questions

    Authors: Guy Lev, Michal Shmueli-Scheuer, Achiya Jerbi, David Konopnicki

    Abstract: Frequently Asked Questions (FAQ) webpages are created by organizations for their users. FAQs are used in several scenarios, e.g., to answer user questions. On the other hand, the content of FAQs is affected by user questions by definition. In order to promote research in this field, several FAQ datasets exist. However, we claim that being collected from community websites, they do not correctly re… ▽ More

    Submitted 3 September, 2020; originally announced September 2020.

  15. arXiv:1908.11152  [pdf, other

    cs.CL

    A Summarization System for Scientific Documents

    Authors: Shai Erera, Michal Shmueli-Scheuer, Guy Feigenblat, Ora Peled Nakash, Odellia Boni, Haggai Roitman, Doron Cohen, Bar Weiner, Yosi Mass, Or Rivlin, Guy Lev, Achiya Jerbi, Jonathan Herzig, Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, David Konopnicki

    Abstract: We present a novel system providing summaries for Computer Science publications. Through a qualitative user study, we identified the most valuable scenarios for discovery, exploration and understanding of scientific documents. Based on these findings, we built a system that retrieves and summarizes scientific documents for a given information need, either in form of a free-text query or by choosin… ▽ More

    Submitted 29 August, 2019; originally announced August 2019.

    Comments: Accepted to EMNLP 2019

  16. arXiv:1906.01351  [pdf, ps, other

    cs.CL

    TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks

    Authors: Guy Lev, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, David Konopnicki

    Abstract: Currently, no large-scale training data is available for the task of scientific paper summarization. In this paper, we propose a novel method that automatically generates summaries for scientific papers, by utilizing videos of talks at scientific conferences. We hypothesize that such talks constitute a coherent and concise description of the papers' content, and can form the basis for good summari… ▽ More

    Submitted 13 June, 2019; v1 submitted 4 June, 2019; originally announced June 2019.

    Comments: Accepted to ACL 2019

  17. arXiv:1711.05780  [pdf, ps, other

    cs.CL

    Detecting Egregious Conversations between Customers and Virtual Agents

    Authors: Tommy Sandbank, Michal Shmueli-Scheuer, Jonathan Herzig, David Konopnicki, John Richards, David Piorkowski

    Abstract: Virtual agents are becoming a prominent channel of interaction in customer service. Not all customer interactions are smooth, however, and some can become almost comically bad. In such instances, a human agent might need to step in and salvage the conversation. Detecting bad conversations is important since disappointing customer service may threaten customer loyalty and impact revenue. In this pa… ▽ More

    Submitted 16 April, 2018; v1 submitted 15 November, 2017; originally announced November 2017.

    Comments: NAACL 2018