Skip to main content

Showing 1–19 of 19 results for author: Gera, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2506.05062  [pdf, ps, other

    cs.CL

    Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation

    Authors: Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim

    Abstract: We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have… ▽ More

    Submitted 5 June, 2025; originally announced June 2025.

    Comments: Code: https://github.com/noy-sternlicht/Debatable-Intelligence

  2. arXiv:2505.03452  [pdf, ps, other

    cs.CL cs.AI cs.LG

    An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation

    Authors: Matan Orbach, Ohad Eytan, Benjamin Sznajder, Ariel Gera, Odellia Boni, Yoav Kantor, Gal Bloch, Omri Levy, Hadas Abraham, Nitzan Barzilay, Eyal Shnarch, Michael E. Factor, Shila Ofek-Koifman, Paula Ta-Shma, Assaf Toledo

    Abstract: Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse d… ▽ More

    Submitted 10 June, 2025; v1 submitted 6 May, 2025; originally announced May 2025.

  3. arXiv:2503.06573  [pdf, other

    cs.CL cs.AI

    WildIFEval: Instruction Following in the Wild

    Authors: Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor

    Abstract: Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natura… ▽ More

    Submitted 9 March, 2025; originally announced March 2025.

  4. arXiv:2502.19412  [pdf, other

    cs.CL

    The Mighty ToRR: A Benchmark for Table Reasoning and Robustness

    Authors: Shir Ashury-Tahan, Yifan Mai, Rajmohan C, Ariel Gera, Yotam Perlitz, Asaf Yehudai, Elron Bandel, Leshem Choshen, Eyal Shnarch, Percy Liang, Michal Shmueli-Scheuer

    Abstract: Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of… ▽ More

    Submitted 2 March, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  5. arXiv:2412.09569  [pdf, ps, other

    cs.CL cs.AI cs.LG

    JuStRank: Benchmarking LLM Judges for System Ranking

    Authors: Ariel Gera, Odellia Boni, Yotam Perlitz, Roy Bar-Haim, Lilach Eden, Asaf Yehudai

    Abstract: Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused o… ▽ More

    Submitted 10 June, 2025; v1 submitted 12 December, 2024; originally announced December 2024.

    Comments: ACL 2025

  6. arXiv:2407.13696  [pdf, other

    cs.CL

    Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

    Authors: Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen

    Abstract: Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank c… ▽ More

    Submitted 12 September, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: Under Review

  7. arXiv:2402.07891  [pdf, other

    cs.CL cs.LG

    Label-Efficient Model Selection for Text Generation

    Authors: Shir Ashury-Tahan, Ariel Gera, Benjamin Sznajder, Leshem Choshen, Liat Ein-Dor, Eyal Shnarch

    Abstract: Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluatio… ▽ More

    Submitted 6 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: Accepted to ACL (main conference)

  8. arXiv:2401.14019  [pdf, other

    cs.CL cs.AI

    Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

    Authors: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz

    Abstract: In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we p… ▽ More

    Submitted 25 January, 2024; originally announced January 2024.

    Comments: Submitted to NAACL demo track

  9. arXiv:2308.11696  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Efficient Benchmarking of Language Models

    Authors: Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli-Scheuer, Leshem Choshen

    Abstract: The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present t… ▽ More

    Submitted 1 April, 2024; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: Accepted to NAACL main track

  10. arXiv:2305.15040  [pdf, other

    cs.CL

    Active Learning for Natural Language Generation

    Authors: Yotam Perlitz, Ariel Gera, Michal Shmueli-Scheuer, Dafna Sheinwald, Noam Slonim, Liat Ein-Dor

    Abstract: The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and time-consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. Howe… ▽ More

    Submitted 17 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Accepted to EMNLP2023 as a long paper

  11. arXiv:2305.01628  [pdf, other

    cs.CL cs.LG

    The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

    Authors: Ariel Gera, Roni Friedman, Ofir Arviv, Chulaka Gunasekara, Benjamin Sznajder, Noam Slonim, Eyal Shnarch

    Abstract: Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Spec… ▽ More

    Submitted 2 May, 2023; originally announced May 2023.

    Comments: 9 pages, 8 figures; To be published in ACL 2023

  12. arXiv:2210.17541  [pdf, other

    cs.CL cs.LG

    Zero-Shot Text Classification with Self-Training

    Authors: Ariel Gera, Alon Halfon, Eyal Shnarch, Yotam Perlitz, Liat Ein-Dor, Noam Slonim

    Abstract: Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and off-the-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and perfor… ▽ More

    Submitted 31 October, 2022; originally announced October 2022.

    Comments: 9 pages, 5 figures; To be published in EMNLP 2022

  13. arXiv:2208.01483  [pdf, other

    cs.CL cs.HC

    Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours

    Authors: Eyal Shnarch, Alon Halfon, Ariel Gera, Marina Danilevsky, Yannis Katsis, Leshem Choshen, Martin Santillan Cooper, Dina Epelboim, Zheng Zhang, Dakuo Wang, Lucy Yip, Liat Ein-Dor, Lena Dankin, Ilya Shnayderman, Ranit Aharonov, Yunyao Li, Naftali Liberman, Philip Levin Slesarev, Gwilym Newton, Shila Ofek-Koifman, Noam Slonim, Yoav Katz

    Abstract: Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) be… ▽ More

    Submitted 31 October, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

    Comments: 7 pages, 2 figures To be published at EMNLP 2022

  14. arXiv:2203.10581  [pdf, other

    cs.CL cs.LG

    Cluster & Tune: Boost Cold Start Performance in Text Classification

    Authors: Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov, Noam Slonim

    Abstract: In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the… ▽ More

    Submitted 20 March, 2022; originally announced March 2022.

    Comments: 9 pages, 6 figures; To be published in ACL 2022

  15. arXiv:1911.10783  [pdf, other

    cs.CL

    Financial Event Extraction Using Wikipedia-Based Weak Supervision

    Authors: Liat Ein-Dor, Ariel Gera, Orith Toledo-Ronen, Alon Halfon, Benjamin Sznajder, Lena Dankin, Yonatan Bilu, Yoav Katz, Noam Slonim

    Abstract: Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of… ▽ More

    Submitted 28 November, 2022; v1 submitted 25 November, 2019; originally announced November 2019.

  16. arXiv:1911.10763  [pdf, other

    cs.CL cs.AI cs.IR

    Corpus Wide Argument Mining -- a Working Solution

    Authors: Liat Ein-Dor, Eyal Shnarch, Lena Dankin, Alon Halfon, Benjamin Sznajder, Ariel Gera, Carlos Alzate, Martin Gleize, Leshem Choshen, Yufang Hou, Yonatan Bilu, Ranit Aharonov, Noam Slonim

    Abstract: One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehen… ▽ More

    Submitted 25 November, 2019; originally announced November 2019.

    Journal ref: AAAI 2020

  17. arXiv:1909.00393  [pdf, other

    cs.CL cs.AI cs.LG

    A Dataset of General-Purpose Rebuttal

    Authors: Matan Orbach, Yonatan Bilu, Ariel Gera, Yoav Kantor, Lena Dankin, Tamar Lavee, Lili Kotlerman, Shachar Mirkin, Michal Jacovi, Ranit Aharonov, Noam Slonim

    Abstract: In Natural Language Understanding, the task of response generation is usually focused on responses to short texts, such as tweets or a turn in a dialog. Here we present a novel task of producing a critical response to a long argumentative text, and suggest a method based on general rebuttal arguments to address it. We do this in the context of the recently-suggested task of listening comprehension… ▽ More

    Submitted 1 September, 2019; originally announced September 2019.

    Comments: EMNLP 2019

  18. arXiv:1908.08336  [pdf, other

    cs.CL

    Argument Invention from First Principles

    Authors: Yonatan Bilu, Ariel Gera, Daniel Hershcovich, Benjamin Sznajder, Dan Lahav, Guy Moshkowich, Anael Malet, Assaf Gavron, Noam Slonim

    Abstract: Competitive debaters often find themselves facing a challenging task -- how to debate a topic they know very little about, with only minutes to prepare, and without access to books or the Internet? What they often do is rely on "first principles", commonplace arguments which are relevant to many topics, and which they have refined in past debates. In this work we aim to explicitly define a taxon… ▽ More

    Submitted 22 August, 2019; originally announced August 2019.

    Comments: Presented at ACL 2019

  19. arXiv:1908.07491  [pdf, ps, other

    cs.CL

    Controversy in Context

    Authors: Benjamin Sznajder, Ariel Gera, Yonatan Bilu, Dafna Sheinwald, Ella Rabinovich, Ranit Aharonov, David Konopnicki, Noam Slonim

    Abstract: With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and,… ▽ More

    Submitted 20 August, 2019; originally announced August 2019.

    Comments: 5 pages