Search | arXiv e-print repository

DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search-Augmented LLMs

Authors: Arie Cattan, Alon Jacovi, Ori Ram, Jonathan Herzig, Roee Aharoni, Sasha Goldshtein, Eran Ofek, Idan Szpektor, Avi Caciularu

Abstract: Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired m… ▽ More Retrieval Augmented Generation (RAG) is a commonly used approach for enhancing large language models (LLMs) with relevant and up-to-date information. However, the retrieved sources can often contain conflicting information and it remains unclear how models should address such discrepancies. In this work, we first propose a novel taxonomy of knowledge conflict types in RAG, along with the desired model behavior for each type. We then introduce CONFLICTS, a high-quality benchmark with expert annotations of conflict types in a realistic RAG setting. CONFLICTS is the first benchmark that enables tracking progress on how models address a wide range of knowledge conflicts. We conduct extensive experiments on this benchmark, showing that LLMs often struggle to appropriately resolve conflicts between sources. While prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses, substantial room for improvement in future research remains. △ Less

Submitted 10 June, 2025; originally announced June 2025.

arXiv:2410.15466 [pdf, other]

Keep Guessing? When Considering Inference Scaling, Mind the Baselines

Authors: Gal Yona, Or Honovich, Omer Levy, Roee Aharoni

Abstract: Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture,… ▽ More Scaling inference compute in large language models (LLMs) through repeated sampling consistently increases the coverage (fraction of problems solved) as the number of samples increases. We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers. To test this conjecture, we define a baseline that enumerates answers according to their prevalence in the training set. Experiments spanning two domains -- mathematical reasoning and factual knowledge -- reveal that this baseline outperforms repeated model sampling for some LLMs, while the coverage for others is on par with that of a mixture strategy that obtains $k$ answers by using only $10$ model samples and similarly guessing the remaining $k-10$ attempts via enumeration. Our baseline enables a more accurate measurement of how much repeated sampling improves coverage in such settings beyond prompt-agnostic guessing. △ Less

Submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.07473 [pdf, other]

Localizing Factual Inconsistencies in Attributable Text Generation

Authors: Arie Cattan, Paul Roit, Shiyue Zhang, David Wan, Roee Aharoni, Idan Szpektor, Mohit Bansal, Ido Dagan

Abstract: There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspi… ▽ More There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement ($κ> 0.7)$. Then, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and open-source LLMs. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2408.10646 [pdf, other]

Beneath the Surface of Consistency: Exploring Cross-lingual Knowledge Representation Sharing in LLMs

Authors: Maxim Ifergan, Leshem Choshen, Roee Aharoni, Idan Szpektor, Omri Abend

Abstract: The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model's ability to answer a query consistently across langu… ▽ More The veracity of a factoid is largely independent of the language it is written in. However, language models are inconsistent in their ability to answer the same factual question across languages. This raises questions about how LLMs represent a given fact across languages. We explore multilingual factual knowledge through two aspects: the model's ability to answer a query consistently across languages, and the ability to ''store'' answers in a shared representation for several languages. We propose a methodology to measure the extent of representation sharing across languages by repurposing knowledge editing methods. We examine LLMs with various multilingual configurations using a new multilingual dataset. We reveal that high consistency does not necessarily imply shared representation, particularly for languages with different scripts. Moreover, we find that script similarity is a dominant factor in representation sharing. Finally, we observe that if LLMs could fully share knowledge across languages, their accuracy in their best-performing language could benefit an increase of up to 150\% on average. These findings highlight the need for improved multilingual knowledge representation in LLMs and suggest a path for the development of more robust and consistent multilingual LLMs. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2407.08789 [pdf, ps, other]

Coloring, list coloring, and fractional coloring in intersections of matroids

Authors: Ron Aharoni, Eli Berger, He Guo, Dani Kotlar

Abstract: It is known that in matroids the difference between the chromatic number and the fractional chromatic number is smaller than 1, and that the list chromatic number is equal to the chromatic number. We investigate the gap within these pairs of parameters for hypergraphs that are the intersection of a given number k of matroids. We prove that in such hypergraphs the list chromatic number is at most k… ▽ More It is known that in matroids the difference between the chromatic number and the fractional chromatic number is smaller than 1, and that the list chromatic number is equal to the chromatic number. We investigate the gap within these pairs of parameters for hypergraphs that are the intersection of a given number k of matroids. We prove that in such hypergraphs the list chromatic number is at most k times the chromatic number and at most 2k-1 times the maximum chromatic number among the k matroids. We study the relationship between three polytopes associated with k-sets of matroids, and connect them to bounds on the fractional chromatic number of the intersection of the members of the k-set. This also connects to bounds on the matroidal matching and covering number of the intersection of the members of the k-set. The tools used are in part topological. △ Less

Submitted 23 April, 2025; v1 submitted 11 July, 2024; originally announced July 2024.

Comments: 29 pages; revised version

MSC Class: 05B35; 05C15; 05C10; 57M15; 52B40; 05C70; 90C27; 05C72

arXiv:2406.13632 [pdf, other]

Can Few-shot Work in Long-Context? Recycling the Context to Generate Demonstrations

Authors: Arie Cattan, Alon Jacovi, Alex Fabrikant, Jonathan Herzig, Roee Aharoni, Hannah Rashkin, Dror Marcus, Avinatan Hassidim, Yossi Matias, Idan Szpektor, Avi Caciularu

Abstract: Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naïvely adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot examp… ▽ More Despite recent advancements in Large Language Models (LLMs), their performance on tasks involving long contexts remains sub-optimal. In-Context Learning (ICL) with few-shot examples may be an appealing solution to enhance LLM performance in this scenario; However, naïvely adding ICL examples with long context introduces challenges, including substantial token overhead added for each few-shot example and context mismatch between the demonstrations and the target query. In this work, we propose to automatically generate few-shot examples for long context QA tasks by recycling contexts. Specifically, given a long input context (1-3k tokens) and a query, we generate additional query-output pairs from the given context as few-shot examples, while introducing the context only once. This ensures that the demonstrations are leveraging the same context as the target query while only adding a small number of tokens to the prompt. We further enhance each demonstration by instructing the model to explicitly identify the relevant paragraphs before the answer, which improves performance while providing fine-grained attribution to the answer source. We apply our method on multiple LLMs and obtain substantial improvements (+16 absolute points on average across models) on various QA datasets with long context, especially when the answer lies within the middle of the context. Surprisingly, despite introducing only single-hop ICL examples, LLMs also successfully generalize to multi-hop long-context QA using our approach. △ Less

Submitted 18 October, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

arXiv:2405.16908 [pdf, other]

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?

Authors: Gal Yona, Roee Aharoni, Mor Geva

Abstract: We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty by hedging its answer (e.g., "I'm not sure, but I think..."). We formalize faithful response uncertainty based on th… ▽ More We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty by hedging its answer (e.g., "I'm not sure, but I think..."). We formalize faithful response uncertainty based on the gap between the model's intrinsic confidence in the assertions it makes and the decisiveness by which they are conveyed. This example-level metric reliably indicates whether the model reflects its uncertainty, as it penalizes both excessive and insufficient hedging. We evaluate a variety of aligned LLMs at faithfully communicating uncertainty on several knowledge-intensive question answering tasks. Our results provide strong evidence that modern LLMs are poor at faithfully conveying their uncertainty, and that better alignment is necessary to improve their trustworthiness. △ Less

Submitted 26 September, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Comments: To appear in EMNLP 2024 (main conference)

arXiv:2405.05904 [pdf, other]

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

Authors: Zorik Gekhman, Gal Yona, Roee Aharoni, Matan Eyal, Amir Feder, Roi Reichart, Jonathan Herzig

Abstract: When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of… ▽ More When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently. △ Less

Submitted 1 October, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: Accepted as a long paper at EMNLP 2024

arXiv:2402.09631 [pdf, ps, other]

Representation Surgery: Theory and Practice of Affine Steering

Authors: Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru

Abstract: Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probabi… ▽ More Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation. △ Less

Submitted 4 June, 2025; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted in ICML 2024

arXiv:2402.00559 [pdf, other]

A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains

Authors: Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, Mor Geva

Abstract: Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enabl… ▽ More Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains - in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/ . △ Less

Submitted 21 May, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

Comments: Accepted to ACL 2024

arXiv:2401.04695 [pdf, other]

Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering with Multi-Granularity Answers

Authors: Gal Yona, Roee Aharoni, Mor Geva

Abstract: Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this… ▽ More Factual questions typically can be answered correctly at different levels of granularity. For example, both ``August 4, 1961'' and ``1961'' are correct answers to the question ``When was Barack Obama born?''. Standard question answering (QA) evaluation protocols, however, do not explicitly take this into account and compare a predicted answer against answers of a single granularity level. In this work, we propose GRANOLA QA, a novel evaluation setting where a predicted answer is evaluated in terms of accuracy and informativeness against a set of multi-granularity answers. We present a simple methodology for enriching existing datasets with multi-granularity answers, and create GRANOLA-EQ, a multi-granularity version of the EntityQuestions dataset. We evaluate a range of decoding methods on GRANOLA-EQ, including a new algorithm, called Decoding with Response Aggregation (DRAG), that is geared towards aligning the response granularity with the model's uncertainty. Our experiments show that large language models with standard decoding tend to generate specific answers, which are often incorrect. In contrast, when evaluated on multi-granularity answers, DRAG yields a nearly 20 point increase in accuracy on average, which further increases for rare entities. Overall, this reveals that standard evaluation and decoding schemes may significantly underestimate the knowledge encapsulated in LMs. △ Less

Submitted 1 August, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

Comments: To appear in ACL 2024 Main Conference

arXiv:2401.01854 [pdf, other]

Multilingual Instruction Tuning With Just a Pinch of Multilinguality

Authors: Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, Matan Eyal

Abstract: As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-follo… ▽ More As instruction-tuned large language models (LLMs) gain global adoption, their ability to follow instructions in multiple languages becomes increasingly crucial. In this work, we investigate how multilinguality during instruction tuning of a multilingual LLM affects instruction-following across languages from the pre-training corpus. We first show that many languages transfer some instruction-following capabilities to other languages from even monolingual tuning. Furthermore, we find that only 40 multilingual examples integrated in an English tuning set substantially improve multilingual instruction-following, both in seen and unseen languages during tuning. In general, we observe that models tuned on multilingual mixtures exhibit comparable or superior performance in multiple languages compared to monolingually tuned models, despite training on 10x fewer examples in those languages. Finally, we find that diversifying the instruction tuning set with even just 2-4 languages significantly improves cross-lingual generalization. Our results suggest that building massively multilingual instruction-tuned models can be done with only a very small set of multilingual instruction-responses. △ Less

Submitted 21 May, 2024; v1 submitted 3 January, 2024; originally announced January 2024.

Comments: Findings of ACL 2024

arXiv:2311.17670 [pdf, ps, other]

2-covers of wide Young diagrams

Authors: Ron Aharoni, Eli Berger, He Guo, Daniel Kotlar

Abstract: A Young diagram $Y$ is called wide if every sub-diagram $Z$ formed by a subset of the rows of $Y$ dominates $Z'$, the conjugate of $Z$. A Young diagram $Y$ is called Latin if its squares can be assigned numbers so that for each $i$, the $i$th row is filled injectively with the numbers $1, \ldots ,a_i$, where $a_i$ is the length of $i$th row of $Y$, and every column is also filled injectively. A co… ▽ More A Young diagram $Y$ is called wide if every sub-diagram $Z$ formed by a subset of the rows of $Y$ dominates $Z'$, the conjugate of $Z$. A Young diagram $Y$ is called Latin if its squares can be assigned numbers so that for each $i$, the $i$th row is filled injectively with the numbers $1, \ldots ,a_i$, where $a_i$ is the length of $i$th row of $Y$, and every column is also filled injectively. A conjecture of Chow and Taylor, publicized by Chow, Fan, Goemans, and Vondrak is that a wide Young diagram is Latin. We prove a dual version of the conjecture. △ Less

Submitted 11 December, 2023; v1 submitted 29 November, 2023; originally announced November 2023.

Comments: 17 pages; Added a few more questions and a reference

MSC Class: 05A17; 05C65; 05C70; 05D15

arXiv:2310.10062 [pdf, other]

A Comprehensive Evaluation of Tool-Assisted Generation Strategies

Authors: Alon Jacovi, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, Mor Geva

Abstract: A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baseli… ▽ More A growing area of research investigates augmenting language models with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that *refine* incorrect outputs with tools outperform strategies that retrieve relevant information *ahead of* or *during generation*; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their *benefits* and *costs*. △ Less

Submitted 28 December, 2023; v1 submitted 16 October, 2023; originally announced October 2023.

Comments: Accepted to EMNLP 2023 Findings

arXiv:2309.03735 [pdf, ps, other]

Looms

Authors: Ron Aharoni, Eli Berger, Joseph Briggs, He Guo, Shira Zerbib

Abstract: A pair $(A,B)$ of hypergraphs is called orthogonal if $|a \cap b|=1$ for every pair of edges $a \in A$ and $b \in B$. An orthogonal pair of hypergraphs is called a loom if each of its two members is the set of minimum covers of the other. Looms appear naturally in the context of a conjecture of Gyárfás and Lehel on the covering number of cross-intersecting hypergraphs. We study their properties an… ▽ More A pair $(A,B)$ of hypergraphs is called orthogonal if $|a \cap b|=1$ for every pair of edges $a \in A$ and $b \in B$. An orthogonal pair of hypergraphs is called a loom if each of its two members is the set of minimum covers of the other. Looms appear naturally in the context of a conjecture of Gyárfás and Lehel on the covering number of cross-intersecting hypergraphs. We study their properties and ways of construction, and prove special cases of a conjecture that if true would imply the Gyárfás--Lehel conjecture. △ Less

Submitted 14 July, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: 20 pages; Minor revisions; Added a coauthor; To appear in Discrete Mathematics

MSC Class: 05C65; 05C35; 05C72; 05C76; 05D15

arXiv:2306.00186 [pdf, other]

Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback

Authors: Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, Idan Szpektor

Abstract: Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work, we leverage recent progress on textual entailment models to directly address this p… ▽ More Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work, we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: ACL 2023

arXiv:2305.14332 [pdf, other]

Evaluating and Modeling Attribution for Cross-Lingual Question Answering

Authors: Benjamin Muller, John Wieting, Jonathan H. Clark, Tom Kwiatkowski, Sebastian Ruder, Livio Baldini Soares, Roee Aharoni, Jonathan Herzig, Xinyi Wang

Abstract: Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve tr… ▽ More Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve trustworthiness in these systems, a promising direction is to attribute the answer to a retrieved source, possibly in a content-rich language different from the query. Our work is the first to study attribution for cross-lingual question answering. First, we collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. To our surprise, we find that a substantial portion of the answers is not attributable to any retrieved passages (up to 50% of answers exactly matching a gold reference) despite the system being able to attend directly to the retrieved text. Second, to address this poor attribution level, we experiment with a wide range of attribution detection techniques. We find that Natural Language Inference models and PaLM 2 fine-tuned on a very small amount of attribution data can accurately detect attribution. Based on these models, we improve the attribution level of a cross-lingual question-answering system. Overall, we show that current academic generative cross-lingual QA systems have substantial shortcomings in attribution and we build tooling to mitigate these issues. △ Less

Submitted 15 November, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: Published as a long paper at EMNLP 2023

arXiv:2305.13194 [pdf, other]

SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation

Authors: Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, Vitaly Nikolaev, Thibault Sellam, Aditya Siddhant, Dipanjan Das, Ankur P. Parikh

Abstract: Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensi… ▽ More Reliable automatic evaluation of summarization systems is challenging due to the multifaceted and subjective nature of the task. This is especially the case for languages other than English, where human evaluations are scarce. In this work, we introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets. As a result of its size and scope, SEAHORSE can serve both as a benchmark to evaluate learnt metrics, as well as a large-scale resource for training such metrics. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE (Honovich et al., 2022) and mFACE (Aharoni et al., 2022). We make the SEAHORSE dataset and metrics publicly available for future research on multilingual and multifaceted summarization evaluation. △ Less

Submitted 1 November, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

arXiv:2305.11171 [pdf, other]

TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models

Authors: Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor

Abstract: Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited… ▽ More Factual consistency evaluation is often conducted using Natural Language Inference (NLI) models, yet these models exhibit limited success in evaluating summaries. Previous work improved such models with synthetic training data. However, the data is typically based on perturbed human-written summaries, which often differ in their characteristics from real model-generated summaries and have limited coverage of possible factual errors. Alternatively, large language models (LLMs) have recently shown promising results in directly evaluating generative tasks, but are too computationally expensive for practical use. Motivated by these limitations, we introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries using a LLM. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature. Experiments on the TRUE benchmark show that a student model trained using our data, substantially outperforms both the state-of-the-art model with similar capacity, and the LLM teacher. In a systematic study, we compare TrueTeacher to existing synthetic data generation methods and demonstrate its superiority and robustness to domain-shift. We also show that our method generalizes to multilingual scenarios. Lastly, we release our large scale synthetic dataset (1.4M examples), generated using TrueTeacher, and a checkpoint trained on this data. △ Less

Submitted 18 October, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

Comments: Accepted as a long paper in EMNLP 2023

arXiv:2305.10400 [pdf, other]

What You See is What You Read? Improving Text-Image Alignment Evaluation

Authors: Michal Yarom, Yonatan Bitton, Soravit Changpinyo, Roee Aharoni, Jonathan Herzig, Oran Lang, Eran Ofek, Idan Szpektor

Abstract: Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to… ▽ More Automatically determining whether a text and a corresponding image are semantically aligned is a significant challenge for vision-language models, with applications in generative text-to-image and image-to-text tasks. In this work, we study methods for automatic text-image alignment evaluation. We first introduce SeeTRUE: a comprehensive evaluation set, spanning multiple datasets from both text-to-image and image-to-text generation tasks, with human judgements for whether a given text-image pair is semantically aligned. We then describe two automatic methods to determine alignment: the first involving a pipeline based on question generation and visual question answering models, and the second employing an end-to-end classification approach by finetuning multimodal pretrained models. Both methods surpass prior approaches in various text-image alignment tasks, with significant improvements in challenging cases that involve complex composition or unnatural images. Finally, we demonstrate how our approaches can localize specific misalignments between an image and a given text, and how they can be used to automatically re-rank candidates in text-to-image generation. △ Less

Submitted 26 December, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted to NeurIPS 2023. Website: https://wysiwyr-itm.github.io/

arXiv:2305.07378 [pdf, other]

Surfacing Biases in Large Language Models using Contrastive Input Decoding

Authors: Gal Yona, Or Honovich, Itay Laish, Roee Aharoni

Abstract: Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next… ▽ More Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations. △ Less

Submitted 12 May, 2023; originally announced May 2023.

arXiv:2304.14318 [pdf, other]

q2d: Turning Questions into Dialogs to Teach Models How to Search

Authors: Yonatan Bitton, Shlomi Cohen-Ganor, Ido Hakimi, Yoad Lewenberg, Roee Aharoni, Enav Weinreb

Abstract: One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from ques… ▽ More One of the exciting capabilities of recent language models for dialog is their ability to independently search for relevant information to ground a given dialog response. However, obtaining training data to teach models how to issue search queries is time and resource consuming. In this work, we propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions. We prompt a large language model (PaLM) to create conversational versions of question answering datasets, and use it to improve query generation models that communicate with external search APIs to ground dialog responses. Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale. Our experiments demonstrate that: (1) For query generation on the QReCC dataset, models trained on our synthetically-generated data achieve 90%--97% of the performance of models trained on the human-generated data; (2) We can successfully generate data for training dialog models in new domains without any existing dialog data as demonstrated on the multi-hop MuSiQue and Bamboogle QA datasets. (3) We perform a thorough analysis of the generated dialogs showing that humans find them of high quality and struggle to distinguish them from human-written dialogs. △ Less

Submitted 26 December, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: Accepted to EMNLP 2023. Website: https://question2dialog.github.io/

arXiv:2301.10312 [pdf, ps, other]

Tight infinite matrices

Authors: Ron Aharoni, He Guo

Abstract: We give a simple proof of a recent result of Gollin and Joó: if a possibly infinite system of homogeneous linear equations $A\vec{x} = \vec{0}$, where $A = (a_{i, j})$ is an $I \times J$ matrix, has only the trivial solution, then there exists an injection $φ: J \to I$, such that $a_{φ(j), j} \neq 0$ for all $j \in J$. We give a simple proof of a recent result of Gollin and Joó: if a possibly infinite system of homogeneous linear equations $A\vec{x} = \vec{0}$, where $A = (a_{i, j})$ is an $I \times J$ matrix, has only the trivial solution, then there exists an injection $φ: J \to I$, such that $a_{φ(j), j} \neq 0$ for all $j \in J$. △ Less

Submitted 24 January, 2023; originally announced January 2023.

Comments: 7 pages

MSC Class: 15A06; 05C50; 05C63

arXiv:2212.10622 [pdf, other]

mFACE: Multilingual Summarization with Factual Consistency Evaluation

Authors: Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata

Abstract: Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically det… ▽ More Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries, reducing their utility for real-world application. Several recent efforts attempt to address this by devising models that automatically detect factual inconsistencies in machine generated summaries. However, they focus exclusively on English, a language with abundant resources. In this work, we leverage factual consistency evaluation models to improve multilingual summarization. We explore two intuitive approaches to mitigate hallucinations based on the signal provided by a multilingual NLI model, namely data filtering and controlled generation. Experimental results in the 45 languages from the XLSum dataset show gains over strong baselines in both automatic and human evaluation. △ Less

Submitted 5 January, 2024; v1 submitted 20 December, 2022; originally announced December 2022.

Comments: 28 pages with links to released data

arXiv:2212.09682 [pdf, other]

Multilingual Sequence-to-Sequence Models for Hebrew NLP

Authors: Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty

Abstract: Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is b… ▽ More Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for LLMs in the case of morphologically rich languages (MRLs) such as Hebrew. We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a specialized, morpheme-based, separately fine-tuned decoder. Using this approach, our experiments show substantial improvements over previously published results on existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs. △ Less

Submitted 19 December, 2022; originally announced December 2022.

arXiv:2212.08037 [pdf, other]

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

Authors: Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Lierni Sestorain Saralegui, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, Kellie Webster

Abstract: Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of… ▽ More Large language models (LLMs) have shown impressive results while requiring little or no direct supervision. Further, there is mounting evidence that LLMs may have potential in information-seeking scenarios. We believe the ability of an LLM to attribute the text that it generates is likely to be crucial in this setting. We formulate and study Attributed QA as a key first step in the development of attributed LLMs. We propose a reproducible evaluation framework for the task and benchmark a broad set of architectures. We take human annotations as a gold standard and show that a correlated automatic metric is suitable for development. Our experimental work gives concrete answers to two key questions (How to measure attribution?, and How well do current state-of-the-art methods perform on attribution?), and give some hints as to how to address a third (How to build LLMs with attribution?). △ Less

Submitted 10 February, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

arXiv:2211.05655 [pdf, other]

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

Authors: Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, Omri Abend

Abstract: Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA mo… ▽ More Question answering models commonly have access to two sources of "knowledge" during inference time: (1) parametric knowledge - the factual knowledge encoded in the model weights, and (2) contextual knowledge - external knowledge (e.g., a Wikipedia passage) given to the model to generate a grounded answer. Having these two sources of knowledge entangled together is a core issue for generative QA models as it is unclear whether the answer stems from the given non-parametric knowledge or not. This unclarity has implications on issues of trust, interpretability and factuality. In this work, we propose a new paradigm in which QA models are trained to disentangle the two sources of knowledge. Using counterfactual data augmentation, we introduce a model that predicts two answers for a given question: one based on given contextual knowledge and one based on parametric knowledge. Our experiments on the Natural Questions dataset show that this approach improves the performance of QA models by making them more robust to knowledge conflicts between the two knowledge sources, while generating useful disentangled answers. △ Less

Submitted 10 November, 2022; originally announced November 2022.

Comments: 12 pages, 2 figures

arXiv:2206.02576 [pdf, ps, other]

Strongly maximal matchings and strongly minimal covers

Authors: Ron Aharoni

Abstract: This is a not-to-be-journal-published paper, aimed to serve as reference. It is a summary of the main ideas on the topic appearing in the title, and an opportunity to state correctly the main conjecture in the field. This is a not-to-be-journal-published paper, aimed to serve as reference. It is a summary of the main ideas on the topic appearing in the title, and an opportunity to state correctly the main conjecture in the field. △ Less

Submitted 3 June, 2022; originally announced June 2022.

arXiv:2204.04991 [pdf, other]

TRUE: Re-evaluating Factual Consistency Evaluation

Authors: Or Honovich, Roee Aharoni, Jonathan Herzig, Hagai Taitelbaum, Doron Kukliansy, Vered Cohen, Thomas Scialom, Idan Szpektor, Avinatan Hassidim, Yossi Matias

Abstract: Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evalu… ▽ More Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatic factual consistency evaluation may help alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and augmenting training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset, slowing their adoption. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear. In this work, we introduce TRUE: a comprehensive survey and assessment of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results. We recommend those methods as a starting point for model and metric developers, and hope TRUE will foster progress towards even better evaluation methods. △ Less

Submitted 3 May, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: Accepted as a long paper to NAACL 2022 main conference

arXiv:2110.14332 [pdf, other]

doi 10.1007/s11856-023-2502-z

Rainbow cycles for families of matchings

Authors: Ron Aharoni, He Guo

Abstract: Given a graph $G$ and a coloring of its edges, a subgraph of $G$ is called rainbow if its edges have distinct colors. The rainbow girth of an edge coloring of G is the minimum length of a rainbow cycle in G. A generalization of the famous Caccetta-Häggkvist conjecture, proposed by the first author, is that if in an coloring of the edge set of an $n$-vertex graph by $n$ colors, in which each color… ▽ More Given a graph $G$ and a coloring of its edges, a subgraph of $G$ is called rainbow if its edges have distinct colors. The rainbow girth of an edge coloring of G is the minimum length of a rainbow cycle in G. A generalization of the famous Caccetta-Häggkvist conjecture, proposed by the first author, is that if in an coloring of the edge set of an $n$-vertex graph by $n$ colors, in which each color class is of size $k$, the rainbow girth is at most $\lceil \frac{n}{k} \rceil$. In the known examples for sharpness of this conjecture the color classes are stars, suggesting that when the color classes are matchings, the result may be improved. We show that the rainbow girth of $n$ matchings of size at least 2 is $O(\log n)$. △ Less

Submitted 24 September, 2024; v1 submitted 27 October, 2021; originally announced October 2021.

Comments: 5 pages; minor edits; to appear in Israel Journal of Mathematics

MSC Class: 05C35; 05D40

Journal ref: Israel Journal of Mathematics 256 (2023), 1--8

arXiv:2110.11183 [pdf, ps, other]

doi 10.1137/22M1529658

Non-uniform degrees and rainbow versions of the Caccetta-Häggkvist conjecture

Authors: Ron Aharoni, Eli Berger, Maria Chudnovsky, He Guo, Shira Zerbib

Abstract: The Caccetta-Häggkvist conjecture (denoted below CHC) states that the directed girth (the smallest length of a directed cycle) $dgirth(D)$ of a directed graph $D$ on $n$ vertices is at most $\lceil \frac{n}{δ^+(D)}\rceil$, where $δ^+(D)$ is the minimum out-degree of~$D$. We consider a version involving all out-degrees, not merely the minimum one, and prove that if $D$ does not contain a sink, then… ▽ More The Caccetta-Häggkvist conjecture (denoted below CHC) states that the directed girth (the smallest length of a directed cycle) $dgirth(D)$ of a directed graph $D$ on $n$ vertices is at most $\lceil \frac{n}{δ^+(D)}\rceil$, where $δ^+(D)$ is the minimum out-degree of~$D$. We consider a version involving all out-degrees, not merely the minimum one, and prove that if $D$ does not contain a sink, then $dgirth(D) \le 2 \sum_{v\in V(D)} \frac{1}{deg^+(v)+1}$. In the spirit of a generalization of the CHC to rainbow cycles in \cite{ADH2019}, this suggests the conjecture that given non-empty sets $F_1, \ldots,F_n$ of edges of $K_n$, there exists a rainbow cycle of length at most $2\sum_{1\le i \le n}\frac{1}{|F_i|+1}$. We prove a bit stronger result when $1\le |F_i|\le 2$, thereby strengthening a result of DeVos et. al \cite{DDFGGHMM2021}. We prove a logarithmic bound on the rainbow girth in the case that the sets $F_i$ are triangles. △ Less

Submitted 7 October, 2022; v1 submitted 21 October, 2021; originally announced October 2021.

Journal ref: SIAM Journal on Discrete Mathematics 37 (2023), 1704--1714

arXiv:2107.12881 [pdf, ps, other]

Choice Functions

Authors: Ron Aharoni, Joseph Briggs

Abstract: This is a survey paper on rainbow sets (another name for ``choice functions''). The main theme is the distinction between two types of choice functions: those having a large (in the sense of belonging to some specified filter, namely closed up set of sets) image, and those that have a large domain and small image, where ``smallness'' means belonging to some specified complex (a closed-down set). T… ▽ More This is a survey paper on rainbow sets (another name for ``choice functions''). The main theme is the distinction between two types of choice functions: those having a large (in the sense of belonging to some specified filter, namely closed up set of sets) image, and those that have a large domain and small image, where ``smallness'' means belonging to some specified complex (a closed-down set). The paper contains some new results: (1) theorems on scrambled versions, in which the sets are re-shuffled before choosing the rainbow set, and (2) results on weighted and cooperative versions - to be defined below. △ Less

Submitted 27 July, 2021; originally announced July 2021.

Comments: 23 pages, survey paper

MSC Class: 05D15; 05C70; 05C69; 05B35; 05E45

arXiv:2104.08202 [pdf, other]

$Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering

Authors: Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, Omri Abend

Abstract: Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic… ▽ More Neural knowledge-grounded generative models for dialogue often produce content that is factually inconsistent with the knowledge they rely on, making them unreliable and limiting their applicability. Inspired by recent work on evaluating factual consistency in abstractive summarization, we propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue using automatic question generation and question answering. Our metric, denoted $Q^2$, compares answer spans using natural language inference (NLI), instead of token-based matching as done in previous work. To foster proper evaluation, we curate a novel dataset of dialogue system outputs for the Wizard-of-Wikipedia dataset, manually annotated for factual consistency. We perform a thorough meta-evaluation of $Q^2$ against other metrics using this dataset and two others, where it consistently shows higher correlation with human judgements. △ Less

Submitted 9 September, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

Comments: Accepted to EMNLP 2021

arXiv:2012.14992 [pdf, ps, other]

Rainbow paths and large rainbow matchings

Authors: Ron Aharoni, Eli Berger, Maria Chudnovsky, Shira Zerbib

Abstract: A conjecture of the first two authors is that $n$ matchings of size $n$ in any graph have a rainbow matching of size $n-1$. We prove a lower bound of $\frac{2}{3}n-1$, improving on the trivial $\frac{1}{2}n$, and an analogous result for hypergraphs. For $\{C_3,C_5\}$-free graphs and for disjoint matchings we obtain a lower bound of $\frac{3n}{4}-O(1)$. We also discuss a conjecture on rainbow alter… ▽ More A conjecture of the first two authors is that $n$ matchings of size $n$ in any graph have a rainbow matching of size $n-1$. We prove a lower bound of $\frac{2}{3}n-1$, improving on the trivial $\frac{1}{2}n$, and an analogous result for hypergraphs. For $\{C_3,C_5\}$-free graphs and for disjoint matchings we obtain a lower bound of $\frac{3n}{4}-O(1)$. We also discuss a conjecture on rainbow alternating paths, that if true would yield a lower bound of $n-\sqrt{2n}$. We prove the non-alternating (ordinary paths) version of this conjecture. △ Less

Submitted 7 October, 2021; v1 submitted 29 December, 2020; originally announced December 2020.

arXiv:2011.01053 [pdf, ps, other]

Fractionally balanced hypergraphs and rainbow KKM theorems

Authors: Ron Aharoni, Eli Berger, Joseph Briggs, Erel Segal-Halevi, Shira Zerbib

Abstract: A d-partite hypergraph is called *fractionally balanced* if there exists a non-negative, not identically zero, function on its edge set that has constant degrees in each vertex side. Using a topological version of Hall's theorem we prove lower bounds on the matching number of such hypergraphs. These bounds yield rainbow versions of the KKM theorem for products of simplices, which in turn are used… ▽ More A d-partite hypergraph is called *fractionally balanced* if there exists a non-negative, not identically zero, function on its edge set that has constant degrees in each vertex side. Using a topological version of Hall's theorem we prove lower bounds on the matching number of such hypergraphs. These bounds yield rainbow versions of the KKM theorem for products of simplices, which in turn are used to obtain some results on multiple-cake division, and on rainbow matchings in families of d-intervals. △ Less

Submitted 14 August, 2022; v1 submitted 2 November, 2020; originally announced November 2020.

arXiv:2009.11027 [pdf, other]

KoBE: Knowledge-Based Machine Translation Evaluation

Authors: Zorik Gekhman, Roee Aharoni, Genady Beryozkin, Markus Freitag, Wolfgang Macherey

Abstract: We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our… ▽ More We propose a simple and effective method for machine translation evaluation which does not require reference translations. Our approach is based on (1) grounding the entity mentions found in each source sentence and candidate translation against a large-scale multilingual knowledge base, and (2) measuring the recall of the grounded entities found in the candidate vs. those found in the source. Our approach achieves the highest correlation with human judgements on 9 out of the 18 language pairs from the WMT19 benchmark for evaluation without references, which is the largest number of wins for a single evaluation method on this task. On 4 language pairs, we also achieve higher correlation with human judgements than BLEU. To foster further research, we release a dataset containing 1.8 million grounded entity mentions across 18 language pairs from the WMT19 metrics track data. △ Less

Submitted 23 September, 2020; originally announced September 2020.

Comments: Accepted as a short paper in Findings of EMNLP 2020

arXiv:2008.04637 [pdf, other]

Real-Time Sign Language Detection using Human Pose Estimation

Authors: Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni, Sarah Ebling, Srini Narayanan

Abstract: We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accurac… ▽ More We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accuracy, while still working under 4ms. We describe a demo application to sign language detection in the browser in order to demonstrate its usage possibility in videoconferencing applications. △ Less

Submitted 13 September, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

Comments: 10 pages

arXiv:2007.09719 [pdf, ps, other]

doi 10.1137/20M1380557

Rainbow odd cycles

Authors: Ron Aharoni, Joseph Briggs, Ron Holzman, Zilin Jiang

Abstract: We prove that every family of (not necessarily distinct) odd cycles $O_1, \dots, O_{2\lceil n/2 \rceil-1}$ in the complete graph $K_n$ on $n$ vertices has a rainbow odd cycle (that is, a set of edges from distinct $O_i$'s, forming an odd cycle). As part of the proof, we characterize those families of $n$ odd cycles in $K_{n+1}$ that do not have any rainbow odd cycle. We also characterize those fam… ▽ More We prove that every family of (not necessarily distinct) odd cycles $O_1, \dots, O_{2\lceil n/2 \rceil-1}$ in the complete graph $K_n$ on $n$ vertices has a rainbow odd cycle (that is, a set of edges from distinct $O_i$'s, forming an odd cycle). As part of the proof, we characterize those families of $n$ odd cycles in $K_{n+1}$ that do not have any rainbow odd cycle. We also characterize those families of $n$ cycles in $K_{n+1}$, as well as those of $n$ edge-disjoint nonempty subgraphs of $K_{n+1}$, without any rainbow cycle. △ Less

Submitted 20 September, 2021; v1 submitted 19 July, 2020; originally announced July 2020.

Comments: 14 pages, 2 figures, accepted to SIAM Journal on Discrete Mathematics (SIDMA)

MSC Class: 05C38 (Primary) 05C70; 05B35 (Secondary)

Journal ref: SIAM Journal on Discrete Mathematics, Volume 35, Issue 4, pp 2293-2303, October 2021

arXiv:2004.07590 [pdf, other]

Badges and rainbow matchings

Authors: Ron Aharoni, Joseph Briggs, Jinha Kim, Minki Kim

Abstract: Drisko proved that $2n-1$ matchings of size $n$ in a bipartite graph have a rainbow matching of size $n$. For general graphs it is conjectured that $2n$ matchings suffice for this purpose (and that $2n-1$ matchings suffice when $n$ is even). The known graphs showing sharpness of this conjecture for $n$ even are called badges. We improve the previously best known bound from $3n-2$ to $3n-3$, using… ▽ More Drisko proved that $2n-1$ matchings of size $n$ in a bipartite graph have a rainbow matching of size $n$. For general graphs it is conjectured that $2n$ matchings suffice for this purpose (and that $2n-1$ matchings suffice when $n$ is even). The known graphs showing sharpness of this conjecture for $n$ even are called badges. We improve the previously best known bound from $3n-2$ to $3n-3$, using a new line of proof that involves analysis of the appearance of badges. We also prove a "cooperative" generalization: for $t>0$ and $n \geq 3$, any $3n-4+t$ sets of edges, the union of every $t$ of which contains a matching of size $n$, have a rainbow matching of size $n$. △ Less

Submitted 15 February, 2021; v1 submitted 16 April, 2020; originally announced April 2020.

Comments: Accepted for publication in Discrete Mathematics. 19 pages, 2 figures

arXiv:2004.02105 [pdf, other]

Unsupervised Domain Clusters in Pretrained Language Models

Authors: Roee Aharoni, Yoav Goldberg

Abstract: The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domain… ▽ More The notion of "in-domain data" in NLP is often over-simplistic and vague, as textual data varies in many nuanced linguistic aspects such as topic, style or level of formality. In addition, domain labels are many times unavailable, making it challenging to build domain-specific systems. We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision -- suggesting a simple data-driven definition of domains in textual data. We harness this property and propose domain data selection methods based on such models, which require only a small set of in-domain monolingual data. We evaluate our data selection methods for neural machine translation across five diverse domains, where they outperform an established approach as measured by both BLEU and by precision and recall of sentence selection with respect to an oracle. △ Less

Submitted 1 May, 2020; v1 submitted 5 April, 2020; originally announced April 2020.

Comments: Accepted as a long paper in ACL 2020

arXiv:2003.08247 [pdf, ps, other]

Cooperative conditions for the existence of rainbow matchings

Authors: Ron Aharoni, Joseph Briggs, Minho Cho, Jinha Kim

Abstract: Let $k>1$, and let $\mathcal{F}$ be a family of $2n+k-3$ non-empty sets of edges in a bipartite graph. If the union of every $k$ members of $\mathcal{F}$ contains a matching of size $n$, then there exists an $\mathcal{F}$-rainbow matching of size $n$. Replacing $2n+k-3$ by $2n+k-2$, the result is true also for $k=1$, and it can be proved (for all $k$) both topologically and by a relatively simple… ▽ More Let $k>1$, and let $\mathcal{F}$ be a family of $2n+k-3$ non-empty sets of edges in a bipartite graph. If the union of every $k$ members of $\mathcal{F}$ contains a matching of size $n$, then there exists an $\mathcal{F}$-rainbow matching of size $n$. Replacing $2n+k-3$ by $2n+k-2$, the result is true also for $k=1$, and it can be proved (for all $k$) both topologically and by a relatively simple combinatorial argument. The main effort is in gaining the last $1$, which makes the result sharp. △ Less

Submitted 28 December, 2021; v1 submitted 18 March, 2020; originally announced March 2020.

arXiv:1910.09302 [pdf, other]

Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets

Authors: Ohad Rozen, Vered Shwartz, Roee Aharoni, Ido Dagan

Abstract: Phenomenon-specific "adversarial" datasets have been recently designed to perform targeted stress-tests for particular inference types. Recent work (Liu et al., 2019a) proposed that such datasets can be utilized for training NLI and other types of models, often allowing to learn the phenomenon in focus and improve on the challenge dataset, indicating a "blind spot" in the original training data. Y… ▽ More Phenomenon-specific "adversarial" datasets have been recently designed to perform targeted stress-tests for particular inference types. Recent work (Liu et al., 2019a) proposed that such datasets can be utilized for training NLI and other types of models, often allowing to learn the phenomenon in focus and improve on the challenge dataset, indicating a "blind spot" in the original training data. Yet, although a model can improve in such a training process, it might still be vulnerable to other challenge datasets targeting the same phenomenon but drawn from a different distribution, such as having a different syntactic complexity level. In this work, we extend this method to drive conclusions about a model's ability to learn and generalize a target phenomenon rather than to "learn" a dataset, by controlling additional aspects in the adversarial datasets. We demonstrate our approach on two inference phenomena - dative alternation and numerical reasoning, elaborating, and in some cases contradicting, the results of Liu et al.. Our methodology enables building better challenge datasets for creating more robust models, and may yield better model understanding and subsequent overarching improvements. △ Less

Submitted 21 October, 2019; originally announced October 2019.

Comments: CoNLL 2019

arXiv:1909.13143 [pdf, ps, other]

Rainbow independent sets in certain classes of graphs

Authors: Ron Aharoni, Joseph Briggs, Jinha Kim, Minki Kim

Abstract: For a given class $\mathcal{C}$ of graphs and given integers $m \leq n$, let $f_\mathcal{C}(n,m)$ be the minimal number $k$ such that every $k$ independent $n$-sets in any graph belonging to $\mathcal{C}$ have a (possibly partial) rainbow independent $m$-set. Motivated by known results on the finiteness and actual value of $f_\mathcal{C}(n,m)$ when $\mathcal{C}$ is the class of line graphs of grap… ▽ More For a given class $\mathcal{C}$ of graphs and given integers $m \leq n$, let $f_\mathcal{C}(n,m)$ be the minimal number $k$ such that every $k$ independent $n$-sets in any graph belonging to $\mathcal{C}$ have a (possibly partial) rainbow independent $m$-set. Motivated by known results on the finiteness and actual value of $f_\mathcal{C}(n,m)$ when $\mathcal{C}$ is the class of line graphs of graphs, we study this function for various other classes. △ Less

Submitted 28 September, 2019; originally announced September 2019.

arXiv:1903.07091 [pdf, other]

The Missing Ingredient in Zero-Shot Neural Machine Translation

Authors: Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, Wolfgang Macherey

Abstract: Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on… ▽ More Multilingual Neural Machine Translation (NMT) models are capable of translating between multiple source and target languages. Despite various approaches to train such models, they have difficulty with zero-shot translation: translating between language pairs that were not together seen during training. In this paper we first diagnose why state-of-the-art multilingual NMT models that rely purely on parameter sharing, fail to generalize to unseen language pairs. We then propose auxiliary losses on the NMT encoder that impose representational invariance across languages. Our simple approach vastly improves zero-shot translation quality without regressing on supervised directions. For the first time, on WMT14 English-FrenchGerman, we achieve zero-shot performance that is on par with pivoting. We also demonstrate the easy scalability of our approach to multiple languages on the IWSLT 2017 shared task. △ Less

Submitted 17 March, 2019; originally announced March 2019.

arXiv:1903.03467 [pdf, other]

Filling Gender & Number Gaps in Neural Machine Translation with Black-box Context Injection

Authors: Amit Moryossef, Roee Aharoni, Yoav Goldberg

Abstract: When translating from a language that does not morphologically mark information such as gender and number into a language that does, translation systems must "guess" this missing information, often leading to incorrect translations in the given context. We propose a black-box approach for injecting the missing information to a pre-trained neural machine translation system, allowing to control the… ▽ More When translating from a language that does not morphologically mark information such as gender and number into a language that does, translation systems must "guess" this missing information, often leading to incorrect translations in the given context. We propose a black-box approach for injecting the missing information to a pre-trained neural machine translation system, allowing to control the morphological variations in the generated translations without changing the underlying model or training data. We evaluate our method on an English to Hebrew translation task, and show that it is effective in injecting the gender and number information and that supplying the correct information improves the translation accuracy in up to 2.3 BLEU on a female-speaker test set for a state-of-the-art online black-box system. Finally, we perform a fine-grained syntactic analysis of the generated translations that shows the effectiveness of our method. △ Less

Submitted 8 March, 2019; originally announced March 2019.

Comments: 6 pages

arXiv:1903.00089 [pdf, other]

Massively Multilingual Neural Machine Translation

Authors: Roee Aharoni, Melvin Johnson, Orhan Firat

Abstract: Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English wit… ▽ More Multilingual neural machine translation (NMT) enables training a single model that supports translation from multiple source languages into multiple target languages. In this paper, we push the limits of multilingual NMT in terms of number of languages being used. We perform extensive experiments in training massively multilingual NMT models, translating up to 102 languages to and from English within a single model. We explore different setups for training such models and analyze the trade-offs between translation quality and various modeling decisions. We report results on the publicly available TED talks multilingual corpus where we show that massively multilingual many-to-many models are effective in low resource settings, outperforming the previous state-of-the-art while supporting up to 59 languages. Our experiments on a large-scale dataset with 102 languages to and from English and up to one million examples per direction also show promising results, surpassing strong bilingual baselines and encouraging future work on massively multilingual NMT. △ Less

Submitted 2 July, 2019; v1 submitted 28 February, 2019; originally announced March 2019.

Comments: Accepted as a long paper in NAACL 2019

arXiv:1812.11872 [pdf, other]

A rainbow version of Mantel's Theorem

Authors: Ron Aharoni, Matt DeVos, Sebastián González Hermosillo de la Maza, Amanda Montejano, Robert Šámal

Abstract: Mantel's Theorem asserts that a simple $n$ vertex graph with more than $\frac{1}{4}n^2$ edges has a triangle (three mutually adjacent vertices). Here we consider a rainbow variant of this problem. We prove that whenever $G_1, G_2, G_3$ are simple graphs on a common set of $n$ vertices and $|E(G_i)| > ( \frac{ 26 - 2 \sqrt{7} }{81})n^2 \approx 0.2557 n^2$ for $1 \le i \le 3$, then there exist disti… ▽ More Mantel's Theorem asserts that a simple $n$ vertex graph with more than $\frac{1}{4}n^2$ edges has a triangle (three mutually adjacent vertices). Here we consider a rainbow variant of this problem. We prove that whenever $G_1, G_2, G_3$ are simple graphs on a common set of $n$ vertices and $|E(G_i)| > ( \frac{ 26 - 2 \sqrt{7} }{81})n^2 \approx 0.2557 n^2$ for $1 \le i \le 3$, then there exist distinct vertices $v_1,v_2,v_3$ so that (working with the indices modulo 3) we have $v_i v_{i+1} \in E(G_i)$ for $1 \le i \le 3$. We provide an example to show this bound is best possible. This also answers a question of Diwan and Mubayi. We include a new short proof of Mantel's Theorem we obtained as a byproduct. △ Less

Submitted 25 February, 2020; v1 submitted 31 December, 2018; originally announced December 2018.

Comments: 12 pages, 3 figures

MSC Class: 05C35

arXiv:1806.06267 [pdf, ps, other]

doi 10.37236/8111

Cooperative colorings of trees and of bipartite graphs

Authors: Ron Aharoni, Eli Berger, Maria Chudnovsky, Frédéric Havet, Zilin Jiang

Abstract: Given a system $(G_1, \ldots ,G_m)$ of graphs on the same vertex set $V$, a cooperative coloring is a choice of vertex sets $I_1, \ldots ,I_m$, such that $I_j$ is independent in $G_j$ and $\bigcup_{j=1}^{m}I_j = V$. For a class $\mathcal{G}$ of graphs, let $m_{\mathcal{G}}(d)$ be the minimal $m$ such that every $m$ graphs from $\mathcal{G}$ with maximum degree $d$ have a cooperative coloring. We p… ▽ More Given a system $(G_1, \ldots ,G_m)$ of graphs on the same vertex set $V$, a cooperative coloring is a choice of vertex sets $I_1, \ldots ,I_m$, such that $I_j$ is independent in $G_j$ and $\bigcup_{j=1}^{m}I_j = V$. For a class $\mathcal{G}$ of graphs, let $m_{\mathcal{G}}(d)$ be the minimal $m$ such that every $m$ graphs from $\mathcal{G}$ with maximum degree $d$ have a cooperative coloring. We prove that $Ω(\log\log d) \le m_\mathcal{T}(d) \le O(\log d)$ and $Ω(\log d)\le m_\mathcal{B}(d) \le O(d/\log d)$, where $\mathcal{T}$ is the class of trees and $\mathcal{B}$ is the class of bipartite graphs. △ Less

Submitted 23 January, 2020; v1 submitted 16 June, 2018; originally announced June 2018.

Comments: 8 pages, 2 figures, accepted to the Electronic Journal of Combinatorics, corrections suggested by the referees have been incorporated

MSC Class: 05C15; 05C69

Journal ref: The Electronic Journal of Combinatorics, volume 27, issue 1, #P1.41, February 2020

arXiv:1805.09732 [pdf, ps, other]

doi 10.1007/s00493-019-4019-y

Rainbow fractional matchings

Authors: Ron Aharoni, Ron Holzman, Zilin Jiang

Abstract: We prove that any family $E_1, \ldots , E_{\lceil rn \rceil}$ of (not necessarily distinct) sets of edges in an $r$-uniform hypergraph, each having a fractional matching of size $n$, has a rainbow fractional matching of size $n$ (that is, a set of edges from distinct $E_i$'s which supports such a fractional matching). When the hypergraph is $r$-partite and $n$ is an integer, the number of sets nee… ▽ More We prove that any family $E_1, \ldots , E_{\lceil rn \rceil}$ of (not necessarily distinct) sets of edges in an $r$-uniform hypergraph, each having a fractional matching of size $n$, has a rainbow fractional matching of size $n$ (that is, a set of edges from distinct $E_i$'s which supports such a fractional matching). When the hypergraph is $r$-partite and $n$ is an integer, the number of sets needed goes down from $rn$ to $rn-r+1$. The problem solved here is a fractional version of the corresponding problem about rainbow matchings, which was solved by Drisko and by Aharoni and Berger in the case of bipartite graphs, but is open for general graphs as well as for $r$-partite hypergraphs with $r>2$. Our topological proof is based on a result of Kalai and Meshulam about a simplicial complex and a matroid on the same vertex set. △ Less

Submitted 6 May, 2019; v1 submitted 24 May, 2018; originally announced May 2018.

Comments: 10 pages, accepted to Combinatorica, corrections suggested by the referees have been incorporated

MSC Class: 05D15; 55U10

Journal ref: Combinatorica, Volume 39, Issue 6, pp 1191-1202, December 2019

arXiv:1805.01035 [pdf, other]

Split and Rephrase: Better Evaluation and a Stronger Baseline

Authors: Roee Aharoni, Yoav Goldberg

Abstract: Splitting and rephrasing a complex sentence into several shorter sentences that convey the same meaning is a challenging problem in NLP. We show that while vanilla seq2seq models can reach high scores on the proposed benchmark (Narayan et al., 2017), they suffer from memorization of the training set which contains more than 89% of the unique simple sentences from the validation and test sets. To a… ▽ More Splitting and rephrasing a complex sentence into several shorter sentences that convey the same meaning is a challenging problem in NLP. We show that while vanilla seq2seq models can reach high scores on the proposed benchmark (Narayan et al., 2017), they suffer from memorization of the training set which contains more than 89% of the unique simple sentences from the validation and test sets. To aid this, we present a new train-development-test data split and neural models augmented with a copy-mechanism, outperforming the best reported baseline by 8.68 BLEU and fostering further progress on the task. △ Less

Submitted 2 May, 2018; originally announced May 2018.

Comments: Accepted as a short paper in ACL 2018

Showing 1–50 of 82 results for author: Aharoni, R