-
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Authors:
Noy Sternlicht,
Ariel Gera,
Roy Bar-Haim,
Tom Hope,
Noam Slonim
Abstract:
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have…
▽ More
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that have previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
△ Less
Submitted 5 June, 2025;
originally announced June 2025.
-
An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation
Authors:
Matan Orbach,
Ohad Eytan,
Benjamin Sznajder,
Ariel Gera,
Odellia Boni,
Yoav Kantor,
Gal Bloch,
Omri Levy,
Hadas Abraham,
Nitzan Barzilay,
Eyal Shnarch,
Michael E. Factor,
Shila Ofek-Koifman,
Paula Ta-Shma,
Assaf Toledo
Abstract:
Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse d…
▽ More
Finding the optimal Retrieval-Augmented Generation (RAG) configuration for a given use case can be complex and expensive. Motivated by this challenge, frameworks for RAG hyper-parameter optimization (HPO) have recently emerged, yet their effectiveness has not been rigorously benchmarked. To address this gap, we present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse domains, including a new one collected for this work on real-world product documentation. Our study explores the largest HPO search space considered to date, with three evaluation metrics as optimization targets. Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with random search, and that it significantly boosts RAG performance for all datasets. For greedy HPO approaches, we show that optimizing model selection first is preferable to the prevalent practice of optimizing according to RAG pipeline order.
△ Less
Submitted 10 June, 2025; v1 submitted 6 May, 2025;
originally announced May 2025.
-
WildIFEval: Instruction Following in the Wild
Authors:
Gili Lior,
Asaf Yehudai,
Ariel Gera,
Liat Ein-Dor
Abstract:
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natura…
▽ More
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness
Authors:
Shir Ashury-Tahan,
Yifan Mai,
Rajmohan C,
Ariel Gera,
Yotam Perlitz,
Asaf Yehudai,
Elron Bandel,
Leshem Choshen,
Eyal Shnarch,
Percy Liang,
Michal Shmueli-Scheuer
Abstract:
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of…
▽ More
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
△ Less
Submitted 2 March, 2025; v1 submitted 26 February, 2025;
originally announced February 2025.
-
JuStRank: Benchmarking LLM Judges for System Ranking
Authors:
Ariel Gera,
Odellia Boni,
Yotam Perlitz,
Roy Bar-Haim,
Lilach Eden,
Asaf Yehudai
Abstract:
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused o…
▽ More
Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge's positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge's quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.
△ Less
Submitted 10 June, 2025; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
Authors:
Yotam Perlitz,
Ariel Gera,
Ofir Arviv,
Asaf Yehudai,
Elron Bandel,
Eyal Shnarch,
Michal Shmueli-Scheuer,
Leshem Choshen
Abstract:
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank c…
▽ More
Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is most commonly done via Benchmark Agreement Testing (BAT), where new benchmarks are validated against established ones using some agreement metric (e.g., rank correlation). Despite the crucial role of BAT for benchmark builders and consumers, there are no standardized procedures for such agreement testing. This deficiency can lead to invalid conclusions, fostering mistrust in benchmarks and upending the ability to properly choose the appropriate benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how some overlooked methodological choices can significantly influence BAT results, potentially undermining the validity of conclusions. To address these inconsistencies, we propose a set of best practices for BAT and demonstrate how utilizing these methodologies greatly improves BAT robustness and validity. To foster adoption and facilitate future research,, we introduce BenchBench, a python package for BAT, and release the BenchBench-leaderboard, a meta-benchmark designed to evaluate benchmarks using their peers. Our findings underscore the necessity for standardized BAT, ensuring the robustness and validity of benchmark evaluations in the evolving landscape of language model research.
BenchBench Package: github.com/IBM/BenchBench
Leaderboard: hf.co/spaces/IBM/BenchBench
△ Less
Submitted 12 September, 2024; v1 submitted 18 July, 2024;
originally announced July 2024.
-
Label-Efficient Model Selection for Text Generation
Authors:
Shir Ashury-Tahan,
Ariel Gera,
Benjamin Sznajder,
Leshem Choshen,
Liat Ein-Dor,
Eyal Shnarch
Abstract:
Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluatio…
▽ More
Model selection for a given target task can be costly, as it may entail extensive annotation of the quality of outputs of different models. We introduce DiffUse, an efficient method to make an informed decision between candidate text generation models based on preference annotations. DiffUse reduces the required amount of annotations, thus saving valuable time and resources in performing evaluation. DiffUse intelligently selects instances by clustering embeddings that represent the semantic differences between model outputs. Thus, it is able to identify a subset of examples that are more informative for preference decisions. Our method is model-agnostic, and can be applied to any text generation model for selecting between models, prompts and configurations. Moreover, we propose a practical iterative approach for dynamically determining how many instances to annotate. In a series of experiments over hundreds of model pairs, we demonstrate that DiffUse can dramatically reduce the required number of annotations -- by up to 75% -- while maintaining high evaluation reliability.
△ Less
Submitted 6 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI
Authors:
Elron Bandel,
Yotam Perlitz,
Elad Venezian,
Roni Friedman-Melamed,
Ofir Arviv,
Matan Orbach,
Shachar Don-Yehyia,
Dafna Sheinwald,
Ariel Gera,
Leshem Choshen,
Michal Shmueli-Scheuer,
Yoav Katz
Abstract:
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we p…
▽ More
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
Efficient Benchmarking of Language Models
Authors:
Yotam Perlitz,
Elron Bandel,
Ariel Gera,
Ofir Arviv,
Liat Ein-Dor,
Eyal Shnarch,
Noam Slonim,
Michal Shmueli-Scheuer,
Leshem Choshen
Abstract:
The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present t…
▽ More
The increasing versatility of language models (LMs) has given rise to a new class of benchmarks that comprehensively assess a broad range of capabilities. Such benchmarks are associated with massive computational costs, extending to thousands of GPU hours per model. However, the efficiency aspect of these evaluation efforts had raised little discussion in the literature. In this work, we present the problem of Efficient Benchmarking, namely, intelligently reducing the computation costs of LM evaluation without compromising reliability. Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off. We propose to evaluate the reliability of such decisions, by using a new measure -- Decision Impact on Reliability, DIoR for short. We find, for example, that a benchmark leader may change by merely removing a low-ranked model from the benchmark, and observe that a correct benchmark ranking can be obtained by considering only a fraction of the evaluation examples. Based on our findings, we outline a set of concrete recommendations for efficient benchmark design and utilization practices. To take a step further, we use our findings to propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability, often reducing computation by x100 or more.
△ Less
Submitted 1 April, 2024; v1 submitted 22 August, 2023;
originally announced August 2023.
-
Active Learning for Natural Language Generation
Authors:
Yotam Perlitz,
Ariel Gera,
Michal Shmueli-Scheuer,
Dafna Sheinwald,
Noam Slonim,
Liat Ein-Dor
Abstract:
The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and time-consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. Howe…
▽ More
The field of Natural Language Generation (NLG) suffers from a severe shortage of labeled data due to the extremely expensive and time-consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to NLG remains largely unexplored. In this paper, we present a first systematic study of active learning for NLG, considering a diverse set of tasks and multiple leading selection strategies, and harnessing a strong instruction-tuned model. Our results indicate that the performance of existing AL strategies is inconsistent, surpassing the baseline of random example selection in some cases but not in others. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to generation tasks.
△ Less
Submitted 17 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers
Authors:
Ariel Gera,
Roni Friedman,
Ofir Arviv,
Chulaka Gunasekara,
Benjamin Sznajder,
Noam Slonim,
Eyal Shnarch
Abstract:
Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Spec…
▽ More
Applying language models to natural language processing tasks typically relies on the representations in the final model layer, as intermediate hidden layer representations are presumed to be less informative. In this work, we argue that due to the gradual improvement across model layers, additional information can be gleaned from the contrast between higher and lower layers during inference. Specifically, in choosing between the probable next token predictions of a generative model, the predictions of lower layers can be used to highlight which candidates are best avoided. We propose a novel approach that utilizes the contrast between layers to improve text generation outputs, and show that it mitigates degenerative behaviors of the model in open-ended generation, significantly improving the quality of generated texts. Furthermore, our results indicate that contrasting between model layers at inference time can yield substantial benefits to certain aspects of general language model capabilities, more effectively extracting knowledge during inference from a given set of model parameters.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Zero-Shot Text Classification with Self-Training
Authors:
Ariel Gera,
Alon Halfon,
Eyal Shnarch,
Yotam Perlitz,
Liat Ein-Dor,
Noam Slonim
Abstract:
Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and off-the-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and perfor…
▽ More
Recent advances in large pretrained language models have increased attention to zero-shot text classification. In particular, models finetuned on natural language inference datasets have been widely adopted as zero-shot classifiers due to their promising results and off-the-shelf availability. However, the fact that such models are unfamiliar with the target task can lead to instability and performance issues. We propose a plug-and-play method to bridge this gap using a simple self-training approach, requiring only the class names along with an unlabeled dataset, and without the need for domain expertise or trial and error. We show that fine-tuning the zero-shot classifier on its most confident predictions leads to significant performance gains across a wide range of text classification tasks, presumably since self-training adapts the zero-shot model to the task at hand.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Label Sleuth: From Unlabeled Text to a Classifier in a Few Hours
Authors:
Eyal Shnarch,
Alon Halfon,
Ariel Gera,
Marina Danilevsky,
Yannis Katsis,
Leshem Choshen,
Martin Santillan Cooper,
Dina Epelboim,
Zheng Zhang,
Dakuo Wang,
Lucy Yip,
Liat Ein-Dor,
Lena Dankin,
Ilya Shnayderman,
Ranit Aharonov,
Yunyao Li,
Naftali Liberman,
Philip Levin Slesarev,
Gwilym Newton,
Shila Ofek-Koifman,
Noam Slonim,
Yoav Katz
Abstract:
Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) be…
▽ More
Text classification can be useful in many real-world scenarios, saving a lot of time for end users. However, building a custom classifier typically requires coding skills and ML knowledge, which poses a significant barrier for many potential users. To lift this barrier, we introduce Label Sleuth, a free open source system for labeling and creating text classifiers. This system is unique for (a) being a no-code system, making NLP accessible to non-experts, (b) guiding users through the entire labeling process until they obtain a custom classifier, making the process efficient -- from cold start to classifier in a few hours, and (c) being open for configuration and extension by developers. By open sourcing Label Sleuth we hope to build a community of users and developers that will broaden the utilization of NLP models.
△ Less
Submitted 31 October, 2022; v1 submitted 2 August, 2022;
originally announced August 2022.
-
Cluster & Tune: Boost Cold Start Performance in Text Classification
Authors:
Eyal Shnarch,
Ariel Gera,
Alon Halfon,
Lena Dankin,
Leshem Choshen,
Ranit Aharonov,
Noam Slonim
Abstract:
In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the…
▽ More
In real-world scenarios, a text classification task often begins with a cold start, when labeled data is scarce. In such cases, the common practice of fine-tuning pre-trained models, such as BERT, for a target classification task, is prone to produce poor performance. We suggest a method to boost the performance of such models by adding an intermediate unsupervised classification task, between the pre-training and fine-tuning phases. As such an intermediate task, we perform clustering and train the pre-trained model on predicting the cluster labels. We test this hypothesis on various data sets, and show that this additional classification phase can significantly improve performance, mainly for topical classification tasks, when the number of labeled instances available for fine-tuning is only a couple of dozen to a few hundred.
△ Less
Submitted 20 March, 2022;
originally announced March 2022.
-
Financial Event Extraction Using Wikipedia-Based Weak Supervision
Authors:
Liat Ein-Dor,
Ariel Gera,
Orith Toledo-Ronen,
Alon Halfon,
Benjamin Sznajder,
Lena Dankin,
Yonatan Bilu,
Yoav Katz,
Noam Slonim
Abstract:
Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of…
▽ More
Extraction of financial and economic events from text has previously been done mostly using rule-based methods, with more recent works employing machine learning techniques. This work is in line with this latter approach, leveraging relevant Wikipedia sections to extract weak labels for sentences describing economic events. Whereas previous weakly supervised approaches required a knowledge-base of such events, or corresponding financial figures, our approach requires no such additional data, and can be employed to extract economic events related to companies which are not even mentioned in the training data.
△ Less
Submitted 28 November, 2022; v1 submitted 25 November, 2019;
originally announced November 2019.
-
Corpus Wide Argument Mining -- a Working Solution
Authors:
Liat Ein-Dor,
Eyal Shnarch,
Lena Dankin,
Alon Halfon,
Benjamin Sznajder,
Ariel Gera,
Carlos Alzate,
Martin Gleize,
Leshem Choshen,
Yufang Hou,
Yonatan Bilu,
Ranit Aharonov,
Noam Slonim
Abstract:
One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehen…
▽ More
One of the main tasks in argument mining is the retrieval of argumentative content pertaining to a given topic. Most previous work addressed this task by retrieving a relatively small number of relevant documents as the initial source for such content. This line of research yielded moderate success, which is of limited use in a real-world system. Furthermore, for such a system to yield a comprehensive set of relevant arguments, over a wide range of topics, it requires leveraging a large and diverse corpus in an appropriate manner. Here we present a first end-to-end high-precision, corpus-wide argument mining system. This is made possible by combining sentence-level queries over an appropriate indexing of a very large corpus of newspaper articles, with an iterative annotation scheme. This scheme addresses the inherent label bias in the data and pinpoints the regions of the sample space whose manual labeling is required to obtain high-precision among top-ranked candidates.
△ Less
Submitted 25 November, 2019;
originally announced November 2019.
-
A Dataset of General-Purpose Rebuttal
Authors:
Matan Orbach,
Yonatan Bilu,
Ariel Gera,
Yoav Kantor,
Lena Dankin,
Tamar Lavee,
Lili Kotlerman,
Shachar Mirkin,
Michal Jacovi,
Ranit Aharonov,
Noam Slonim
Abstract:
In Natural Language Understanding, the task of response generation is usually focused on responses to short texts, such as tweets or a turn in a dialog. Here we present a novel task of producing a critical response to a long argumentative text, and suggest a method based on general rebuttal arguments to address it. We do this in the context of the recently-suggested task of listening comprehension…
▽ More
In Natural Language Understanding, the task of response generation is usually focused on responses to short texts, such as tweets or a turn in a dialog. Here we present a novel task of producing a critical response to a long argumentative text, and suggest a method based on general rebuttal arguments to address it. We do this in the context of the recently-suggested task of listening comprehension over argumentative content: given a speech on some specified topic, and a list of relevant arguments, the goal is to determine which of the arguments appear in the speech. The general rebuttals we describe here (written in English) overcome the need for topic-specific arguments to be provided, by proving to be applicable for a large set of topics. This allows creating responses beyond the scope of topics for which specific arguments are available. All data collected during this work is freely available for research.
△ Less
Submitted 1 September, 2019;
originally announced September 2019.
-
Argument Invention from First Principles
Authors:
Yonatan Bilu,
Ariel Gera,
Daniel Hershcovich,
Benjamin Sznajder,
Dan Lahav,
Guy Moshkowich,
Anael Malet,
Assaf Gavron,
Noam Slonim
Abstract:
Competitive debaters often find themselves facing a challenging task -- how to debate a topic they know very little about, with only minutes to prepare, and without access to books or the Internet? What they often do is rely on "first principles", commonplace arguments which are relevant to many topics, and which they have refined in past debates.
In this work we aim to explicitly define a taxon…
▽ More
Competitive debaters often find themselves facing a challenging task -- how to debate a topic they know very little about, with only minutes to prepare, and without access to books or the Internet? What they often do is rely on "first principles", commonplace arguments which are relevant to many topics, and which they have refined in past debates.
In this work we aim to explicitly define a taxonomy of such principled recurring arguments, and, given a controversial topic, to automatically identify which of these arguments are relevant to the topic.
As far as we know, this is the first time that this approach to argument invention is formalized and made explicit in the context of NLP.
The main goal of this work is to show that it is possible to define such a taxonomy. While the taxonomy suggested here should be thought of as a "first attempt" it is nonetheless coherent, covers well the relevant topics and coincides with what professional debaters actually argue in their speeches, and facilitates automatic argument invention for new topics.
△ Less
Submitted 22 August, 2019;
originally announced August 2019.
-
Controversy in Context
Authors:
Benjamin Sznajder,
Ariel Gera,
Yonatan Bilu,
Dafna Sheinwald,
Ella Rabinovich,
Ranit Aharonov,
David Konopnicki,
Noam Slonim
Abstract:
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and,…
▽ More
With the growing interest in social applications of Natural Language Processing and Computational Argumentation, a natural question is how controversial a given concept is. Prior works relied on Wikipedia's metadata and on content analysis of the articles pertaining to a concept in question. Here we show that the immediate textual context of a concept is strongly indicative of this property, and, using simple and language-independent machine-learning tools, we leverage this observation to achieve state-of-the-art results in controversiality prediction. In addition, we analyze and make available a new dataset of concepts labeled for controversiality. It is significantly larger than existing datasets, and grades concepts on a 0-10 scale, rather than treating controversiality as a binary label.
△ Less
Submitted 20 August, 2019;
originally announced August 2019.