-
EuroGEST: Investigating gender stereotypes in multilingual language models
Authors:
Jacqueline Rowe,
Mateusz Klimaszewski,
Liane Guillou,
Shannon Vallor,
Alexandra Birch
Abstract:
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality…
▽ More
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric. We introduce EuroGEST, a dataset designed to measure gender-stereotypical reasoning in LLMs across English and 29 European languages. EuroGEST builds on an existing expert-informed benchmark covering 16 gender stereotypes, expanded in this work using translation tools, quality estimation metrics, and morphological heuristics. Human evaluations confirm that our data generation method results in high accuracy of both translations and gender labels across languages. We use EuroGEST to evaluate 24 multilingual language models from six model families, demonstrating that the strongest stereotypes in all models across all languages are that women are 'beautiful', 'empathetic' and 'neat' and men are 'leaders', 'strong, tough' and 'professional'. We also show that larger models encode gendered stereotypes more strongly and that instruction finetuning does not consistently reduce gendered stereotypes. Our work highlights the need for more multilingual studies of fairness in LLMs and offers scalable methods and resources to audit gender bias across languages.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Distributed Discrete Morse Sandwich: Efficient Computation of Persistence Diagrams for Massive Scalar Data
Authors:
Eve Le Guillou,
Pierre Fortin,
Julien Tierny
Abstract:
The persistence diagram, which describes the topological features of a dataset, is a key descriptor in Topological Data Analysis. The "Discrete Morse Sandwich" (DMS) method has been reported to be the most efficient algorithm for computing persistence diagrams of 3D scalar fields on a single node, using shared-memory parallelism. In this work, we extend DMS to distributed-memory parallelism for th…
▽ More
The persistence diagram, which describes the topological features of a dataset, is a key descriptor in Topological Data Analysis. The "Discrete Morse Sandwich" (DMS) method has been reported to be the most efficient algorithm for computing persistence diagrams of 3D scalar fields on a single node, using shared-memory parallelism. In this work, we extend DMS to distributed-memory parallelism for the efficient and scalable computation of persistence diagrams for massive datasets across multiple compute nodes. On the one hand, we can leverage the embarrassingly parallel procedure of the first and most time-consuming step of DMS (namely the discrete gradient computation). On the other hand, the efficient distributed computations of the subsequent DMS steps are much more challenging. To address this, we have extensively revised the DMS routines by contributing a new self-correcting distributed pairing algorithm, redesigning key data structures and introducing computation tokens to coordinate distributed computations. We have also introduced a dedicated communication thread to overlap communication and computation. Detailed performance analyses show the scalability of our hybrid MPI+thread approach for strong and weak scaling using up to 16 nodes of 32 cores (512 cores total). Our algorithm outperforms DIPHA, a reference method for the distributed computation of persistence diagrams, with an average speedup of x8 on 512 cores. We show the practical capabilities of our approach by computing the persistence diagram of a public 3D scalar field of 6 billion vertices in 174 seconds on 512 cores. Finally, we provide a usage example of our open-source implementation at https://github.com/eve-le-guillou/DDMS-example.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?
Authors:
Evangelia Gogoulou,
Shorouq Zahra,
Liane Guillou,
Luise Dürlich,
Joakim Nivre
Abstract:
A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks:…
▽ More
A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as hallucination. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and language and we investigate the impact of model size, instruction tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes
Authors:
Raúl Vázquez,
Timothee Mickus,
Elaine Zosa,
Teemu Vahtola,
Jörg Tiedemann,
Aman Sinha,
Vincent Segonne,
Fernando Sánchez-Vega,
Alessandro Raganato,
Jindřich Libovický,
Jussi Karlgren,
Shaoxiong Ji,
Jindřich Helcl,
Liane Guillou,
Ona de Gibert,
Jaione Bengoetxea,
Joseph Attieh,
Marianna Apidianaki
Abstract:
We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies…
▽ More
We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The large number of submissions underscores the interest of the community in hallucination detection. We present the results of the participating systems and conduct an empirical analysis to identify key factors contributing to strong performance in this task. We also emphasize relevant current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.
△ Less
Submitted 28 April, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Authors:
Laurie Burchell,
Ona de Gibert,
Nikolay Arefyev,
Mikko Aulamo,
Marta Bañón,
Pinzhen Chen,
Mariia Fedorova,
Liane Guillou,
Barry Haddow,
Jan Hajič,
Jindřich Helcl,
Erik Henriksson,
Mateusz Klimaszewski,
Ville Komulainen,
Andrey Kutuzov,
Joona Kytöniemi,
Veronika Laippala,
Petter Mæhlum,
Bhavitvya Malik,
Farrokh Mehryary,
Vladislav Mikhailov,
Nikita Moghe,
Amanda Myntti,
Dayyán O'Brien,
Stephan Oepen
, et al. (10 additional authors not shown)
Abstract:
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 langu…
▽ More
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
△ Less
Submitted 4 June, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Evaluating the Meta- and Object-Level Reasoning of Large Language Models for Question Answering
Authors:
Nick Ferguson,
Liane Guillou,
Alan Bundy,
Kwabena Nuamah
Abstract:
Large Language Models (LLMs) excel in natural language tasks but still face challenges in Question Answering (QA) tasks requiring complex, multi-step reasoning. We outline the types of reasoning required in some of these tasks, and reframe them in terms of meta-level reasoning (akin to high-level strategic reasoning or planning) and object-level reasoning (embodied in lower-level tasks such as mat…
▽ More
Large Language Models (LLMs) excel in natural language tasks but still face challenges in Question Answering (QA) tasks requiring complex, multi-step reasoning. We outline the types of reasoning required in some of these tasks, and reframe them in terms of meta-level reasoning (akin to high-level strategic reasoning or planning) and object-level reasoning (embodied in lower-level tasks such as mathematical reasoning). Franklin, a novel dataset with requirements of meta- and object-level reasoning, is introduced and used along with three other datasets to evaluate four LLMs at question answering tasks requiring multiple steps of reasoning. Results from human annotation studies suggest LLMs demonstrate meta-level reasoning with high frequency, but struggle with object-level reasoning tasks in some of the datasets used. Additionally, evidence suggests that LLMs find the object-level reasoning required for the questions in the Franklin dataset challenging, yet they do exhibit strong performance with respect to the meta-level reasoning requirements.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Phase-Bounded Broadcast Networks over Topologies of Communication
Authors:
Lucie Guillou,
Arnaud Sangnier,
Nathalie Sznajder
Abstract:
We study networks of processes that all execute the same finite state protocol and that communicate through broadcasts. The processes are organized in a graph (a topology) and only the neighbors of a process in this graph can receive its broadcasts. The coverability problem asks, given a protocol and a state of the protocol, whether there is a topology for the processes such that one of them (at l…
▽ More
We study networks of processes that all execute the same finite state protocol and that communicate through broadcasts. The processes are organized in a graph (a topology) and only the neighbors of a process in this graph can receive its broadcasts. The coverability problem asks, given a protocol and a state of the protocol, whether there is a topology for the processes such that one of them (at least) reaches the given state. This problem is undecidable. We study here an under-approximation of the problem where processes alternate a bounded number of times $k$ between phases of broadcasting and phases of receiving messages. We show that, if the problem remains undecidable when $k$ is greater than 6, it becomes decidable for $k=2$, and EXPSPACE-complete for $k=1$. Furthermore, we show that if we restrict ourselves to line topologies, the problem is in $P$ for $k=1$ and $k=2$.
△ Less
Submitted 4 July, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
Safety Verification of Wait-Only Non-Blocking Broadcast Protocols
Authors:
Lucie Guillou,
Arnaud Sangnier,
Nathalie Sznajder
Abstract:
We study networks of processes that all execute the same finite protocol and communicate synchronously in two different ways: a process can broadcast one message to all other processes or send it to at most one other process. In both cases, if no process can receive the message, it will still be sent. We establish a precise complexity class for two coverability problems with a parameterised number…
▽ More
We study networks of processes that all execute the same finite protocol and communicate synchronously in two different ways: a process can broadcast one message to all other processes or send it to at most one other process. In both cases, if no process can receive the message, it will still be sent. We establish a precise complexity class for two coverability problems with a parameterised number of processes: the state coverability problem and the configuration coverability problem. It is already known that these problems are Ackermann-hard (but decidable) in the general case. We show that when the protocol is Wait-Only, i.e., it has no state from which a process can send and receive messages, the complexity drops to P and PSPACE, respectively.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets
Authors:
Nikita Moghe,
Arnisa Fazla,
Chantal Amrhein,
Tom Kocmi,
Mark Steedman,
Alexandra Birch,
Rico Sennrich,
Liane Guillou
Abstract:
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce…
▽ More
Recent machine translation (MT) metrics calibrate their effectiveness by correlating with human judgement but without any insights about their behaviour across different error types. Challenge sets are used to probe specific dimensions of metric behaviour but there are very few such datasets and they either focus on a limited number of phenomena or a limited number of language pairs. We introduce ACES, a contrastive challenge set spanning 146 language pairs, aimed at discovering whether metrics can identify 68 translation accuracy errors. These phenomena range from simple alterations at the word/character level to more complex errors based on discourse and real-world knowledge. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks. We benchmark metric performance, assess their incremental performance over successive campaigns, and measure their sensitivity to a range of linguistic phenomena. We also investigate claims that Large Language Models (LLMs) are effective as MT evaluators by evaluating on ACES. Our results demonstrate that different metric families struggle with different phenomena and that LLM-based methods fail to demonstrate reliable performance. Our analyses indicate that most metrics ignore the source sentence, tend to prefer surface-level overlap and end up incorporating properties of base models which are not always beneficial. We expand ACES to include error span annotations, denoted as SPAN-ACES and we use this dataset to evaluate span-based error metrics showing these metrics also need considerable improvement. Finally, we provide a set of recommendations for building better MT metrics, including focusing on error labels instead of scores, ensembling, designing strategies to explicitly focus on the source sentence, focusing on semantic content and choosing the right base model for representations.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Process-Commutative Distributed Objects: From Cryptocurrencies to Byzantine-Fault-Tolerant CRDTs
Authors:
Davide Frey,
Lucie Guillou,
Michel Raynal,
François Taïani
Abstract:
This paper explores the territory that lies between best-effort Byzantine-Fault-Tolerant Conflict-free Replicated Data Types (BFT CRDTs) and totally ordered distributed ledgers, such as those implemented by Blockchains. It formally characterizes a novel class of distributed objects that only requires a First In First Out (FIFO) order on the object operations from each process (taken individually).…
▽ More
This paper explores the territory that lies between best-effort Byzantine-Fault-Tolerant Conflict-free Replicated Data Types (BFT CRDTs) and totally ordered distributed ledgers, such as those implemented by Blockchains. It formally characterizes a novel class of distributed objects that only requires a First In First Out (FIFO) order on the object operations from each process (taken individually). The formalization leverages Mazurkiewicz traces to define legal sequences of operations and ensure both Strong Eventual Consistency (SEC) and Pipleline Consistency (PC). The paper presents a generic algorithm that implements this novel class of distributed objects both in a crash- and Byzantine setting. It also illustrates the practical interest of the proposed approach using four instances of this class of objects, namely money transfer, Petri nets, multi-sets, and concurrent work stealing dequeues.
△ Less
Submitted 8 March, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
ACES: Translation Accuracy Challenge Sets at WMT 2023
Authors:
Chantal Amrhein,
Nikita Moghe,
Liane Guillou
Abstract:
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs. The phenomena range from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. For each met…
▽ More
We benchmark the performance of segmentlevel metrics submitted to WMT 2023 using the ACES Challenge Set (Amrhein et al., 2022). The challenge set consists of 36K examples representing challenges from 68 phenomena and covering 146 language pairs. The phenomena range from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. For each metric, we provide a detailed profile of performance over a range of error categories as well as an overall ACES-Score for quick comparison. We also measure the incremental performance of the metrics submitted to both WMT 2023 and 2022. We find that 1) there is no clear winner among the metrics submitted to WMT 2023, and 2) performance change between the 2023 and 2022 versions of the metrics is highly variable. Our recommendations are similar to those from WMT 2022. Metric developers should focus on: building ensembles of metrics from different design families, developing metrics that pay more attention to the source and rely less on surface-level overlap, and carefully determining the influence of multilingual embeddings on MT evaluation.
△ Less
Submitted 2 November, 2023;
originally announced November 2023.
-
TTK is Getting MPI-Ready
Authors:
Eve Le Guillou,
Michael Will,
Pierre Guillou,
Jonas Lukasczyk,
Pierre Fortin,
Christoph Garth,
Julien Tierny
Abstract:
This system paper documents the technical foundations for the extension of the Topology ToolKit (TTK) to distributed-memory parallelism with the Message Passing Interface (MPI). While several recent papers introduced topology-based approaches for distributed-memory environments, these were reporting experiments obtained with tailored, mono-algorithm implementations. In contrast, we describe in thi…
▽ More
This system paper documents the technical foundations for the extension of the Topology ToolKit (TTK) to distributed-memory parallelism with the Message Passing Interface (MPI). While several recent papers introduced topology-based approaches for distributed-memory environments, these were reporting experiments obtained with tailored, mono-algorithm implementations. In contrast, we describe in this paper a versatile approach (supporting both triangulated domains and regular grids) for the support of topological analysis pipelines, i.e. a sequence of topological algorithms interacting together. While developing this extension, we faced several algorithmic and software engineering challenges, which we document in this paper. We describe an MPI extension of TTK's data structure for triangulation representation and traversal, a central component to the global performance and generality of TTK's topological implementations. We also introduce an intermediate interface between TTK and MPI, both at the global pipeline level, and at the fine-grain algorithmic level. We provide a taxonomy for the distributed-memory topological algorithms supported by TTK, depending on their communication needs and provide examples of hybrid MPI+thread parallelizations. Performance analyses show that parallel efficiencies range from 20% to 80% (depending on the algorithms), and that the MPI-specific preconditioning introduced by our framework induces a negligible computation time overhead. We illustrate the new distributed-memory capabilities of TTK with an example of advanced analysis pipeline, combining multiple algorithms, run on the largest publicly available dataset we have found (120 billion vertices) on a cluster with 64 nodes (for a total of 1536 cores). Finally, we provide a roadmap for the completion of TTK's MPI extension, along with generic recommendations for each algorithm communication category.
△ Less
Submitted 15 April, 2024; v1 submitted 12 October, 2023;
originally announced October 2023.
-
Safety Analysis of Parameterised Networks with Non-Blocking Rendez-Vous
Authors:
Lucie Guillou,
Arnaud Sangnier,
Nathalie Sznajder
Abstract:
We consider networks of processes that all execute the same finite-state protocol and communicate via a rendez-vous mechanism. When a process requests a rendez-vous, another process can respond to it and they both change their control states accordingly. We focus here on a specific semantics, called non-blocking, where the process requesting a rendez-vous can change its state even if no process ca…
▽ More
We consider networks of processes that all execute the same finite-state protocol and communicate via a rendez-vous mechanism. When a process requests a rendez-vous, another process can respond to it and they both change their control states accordingly. We focus here on a specific semantics, called non-blocking, where the process requesting a rendez-vous can change its state even if no process can respond to it. In this context, we study the parameterised coverability problem of a configuration, which consists in determining whether there is an initialnumber of processes and an execution allowing to reach a configuration bigger than a given one. We show that this problem is EXPSPACE-complete and can be solved in polynomial time if the protocol is partitioned into two sets of states, the states from which a process can request a rendez-vous and the ones from which it can answer one. We also prove that the problem of the existence of an execution bringing all the processes in a final state is undecidable in our context. These two problems can be solved in polynomial time with the classical rendez-vous semantics.
△ Less
Submitted 10 July, 2023;
originally announced July 2023.
-
Parameterized Broadcast Networks with Registers: from NP to the Frontiers of Decidability
Authors:
Lucie Guillou,
Corto Mascle,
Nicolas Waldburger
Abstract:
We consider the parameterized verification of arbitrarily large networks of agents which communicate by broadcasting and receiving messages. In our model, the broadcast topology is reconfigurable so that a sent message can be received by any set of agents. In addition, agents have local registers which are initially distinct and may therefore be thought of as identifiers. When an agent broadcasts…
▽ More
We consider the parameterized verification of arbitrarily large networks of agents which communicate by broadcasting and receiving messages. In our model, the broadcast topology is reconfigurable so that a sent message can be received by any set of agents. In addition, agents have local registers which are initially distinct and may therefore be thought of as identifiers. When an agent broadcasts a message, it appends to the message the value stored in one of its registers. Upon reception, an agent can store the received value or test this value for equality with one of its own registers. We consider the coverability problem, where one asks whether a given state of the system may be reached by at least one agent. We establish that this problem is decidable; however, it is as hard as coverability in lossy channel systems, which is non-primitive recursive. This model lies at the frontier of decidability as other classical problems on this model are undecidable; this is in particular true for the target problem where all processes must synchronize on a given state. By contrast, we show that the coverability problem is NP-complete when each agent has only one register.
△ Less
Submitted 4 March, 2024; v1 submitted 2 June, 2023;
originally announced June 2023.
-
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue
Authors:
Nikita Moghe,
Evgeniia Razumovskaia,
Liane Guillou,
Ivan Vulić,
Anna Korhonen,
Alexandra Birch
Abstract:
Task-oriented dialogue (TOD) systems have been widely deployed in many industries as they deliver more efficient customer support. These systems are typically constructed for a single domain or language and do not generalise well beyond this. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingu…
▽ More
Task-oriented dialogue (TOD) systems have been widely deployed in many industries as they deliver more efficient customer support. These systems are typically constructed for a single domain or language and do not generalise well beyond this. To support work on Natural Language Understanding (NLU) in TOD across multiple languages and domains simultaneously, we constructed MULTI3NLU++, a multilingual, multi-intent, multi-domain dataset. MULTI3NLU++ extends the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages (Spanish, Marathi, Turkish and Amharic), in two domains (BANKING and HOTELS). Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals, and therefore allows us to measure the realistic performance of TOD systems in a varied set of the world's languages. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the NLU tasks of intent detection and slot labelling for TOD systems in the multilingual setting. The results demonstrate the challenging nature of the dataset, particularly in the low-resource language setting, offering ample room for future experimentation in multi-domain multilingual TOD setups.
△ Less
Submitted 19 June, 2023; v1 submitted 20 December, 2022;
originally announced December 2022.
-
ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics
Authors:
Chantal Amrhein,
Nikita Moghe,
Liane Guillou
Abstract:
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accurac…
▽ More
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
△ Less
Submitted 6 December, 2022; v1 submitted 27 October, 2022;
originally announced October 2022.
-
Investigating the use of Paraphrase Generation for Question Reformulation in the FRANK QA system
Authors:
Nick Ferguson,
Liane Guillou,
Kwabena Nuamah,
Alan Bundy
Abstract:
We present a study into the ability of paraphrase generation methods to increase the variety of natural language questions that the FRANK Question Answering system can answer. We first evaluate paraphrase generation methods on the LC-QuAD 2.0 dataset using both automatic metrics and human judgement, and discuss their correlation. Error analysis on the dataset is also performed using both automatic…
▽ More
We present a study into the ability of paraphrase generation methods to increase the variety of natural language questions that the FRANK Question Answering system can answer. We first evaluate paraphrase generation methods on the LC-QuAD 2.0 dataset using both automatic metrics and human judgement, and discuss their correlation. Error analysis on the dataset is also performed using both automatic and manual approaches, and we discuss how paraphrase generation and evaluation is affected by data points which contain error. We then simulate an implementation of the best performing paraphrase generation method (an English-French backtranslation) into FRANK in order to test our original hypothesis, using a small challenge dataset. Our two main conclusions are that cleaning of LC-QuAD 2.0 is required as the errors present can affect evaluation; and that, due to limitations of FRANK's parser, paraphrase generation is not a method which we can rely on to improve the variety of natural language questions that FRANK can answer.
△ Less
Submitted 6 June, 2022;
originally announced June 2022.
-
Cross-lingual Inference with A Chinese Entailment Graph
Authors:
Tianyi Li,
Sabine Weber,
Mohammad Javad Hosseini,
Liane Guillou,
Mark Steedman
Abstract:
Predicate entailment detection is a crucial task for question-answering from text, where previous work has explored unsupervised learning of entailment graphs from typed open relation triples. In this paper, we present the first pipeline for building Chinese entailment graphs, which involves a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing…
▽ More
Predicate entailment detection is a crucial task for question-answering from text, where previous work has explored unsupervised learning of entailment graphs from typed open relation triples. In this paper, we present the first pipeline for building Chinese entailment graphs, which involves a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing dataset under the FIGER type ontology. Through experiments on the Levy-Holt dataset, we verify the strength of our Chinese entailment graph, and reveal the cross-lingual complementarity: on the parallel Levy-Holt dataset, an ensemble of Chinese and English entailment graphs outperforms both monolingual graphs, and raises unsupervised SOTA by 4.7 AUC points.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Parameterized Analysis of Reconfigurable Broadcast Networks (Long Version)
Authors:
A. R. Balasubramanian,
Lucie Guillou,
Chana Weil-Kennedy
Abstract:
Reconfigurable broadcast networks (RBN) are a model of distributed computation in which agents can broadcast messages to other agents using some underlying communication topology which can change arbitrarily over the course of executions. In this paper, we conduct parameterized analysis of RBN. We consider cubes,(infinite) sets of configurations in the form of lower and upper bounds on the number…
▽ More
Reconfigurable broadcast networks (RBN) are a model of distributed computation in which agents can broadcast messages to other agents using some underlying communication topology which can change arbitrarily over the course of executions. In this paper, we conduct parameterized analysis of RBN. We consider cubes,(infinite) sets of configurations in the form of lower and upper bounds on the number of agents in each state, and we show that we can evaluate boolean combinations over cubes and reachability sets of cubes in PSPACE. In particular, reachability from a cube to another cube is a PSPACE-complete problem.
To prove the upper bound for this parameterized analysis, we prove some structural properties about the reachability sets and the symbolic graph abstraction of RBN, which might be of independent interest. We justify this claim by providing two applications of these results. First, we show that the almost-sure coverability problem is PSPACE-complete for RBN, thereby closing a complexity gap from a previous paper. Second, we define a computation model using RBN, à la population protocols, called RBN protocols. We characterize precisely the set of predicates that can be computed by such protocols.
△ Less
Submitted 11 July, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
Blindness to Modality Helps Entailment Graph Mining
Authors:
Liane Guillou,
Sander Bijl de Vroe,
Mark Johnson,
Mark Steedman
Abstract:
Understanding linguistic modality is widely seen as important for downstream tasks such as Question Answering and Knowledge Graph Population. Entailment Graph learning might also be expected to benefit from attention to modality. We build Entailment Graphs using a news corpus filtered with a modality parser, and show that stripping modal modifiers from predicates in fact increases performance. Thi…
▽ More
Understanding linguistic modality is widely seen as important for downstream tasks such as Question Answering and Knowledge Graph Population. Entailment Graph learning might also be expected to benefit from attention to modality. We build Entailment Graphs using a news corpus filtered with a modality parser, and show that stripping modal modifiers from predicates in fact increases performance. This suggests that for some tasks, the pragmatics of modal modification of predicates allows them to contribute as evidence of entailment.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
Incorporating Temporal Information in Entailment Graph Mining
Authors:
Liane Guillou,
Sander Bijl de Vroe,
Mohammad Javad Hosseini,
Mark Johnson,
Mark Steedman
Abstract:
We present a novel method for injecting temporality into entailment graphs to address the problem of spurious entailments, which may arise from similar but temporally distinct events involving the same pair of entities. We focus on the sports domain in which the same pairs of teams play on different occasions, with different outcomes. We present an unsupervised model that aims to learn entailments…
▽ More
We present a novel method for injecting temporality into entailment graphs to address the problem of spurious entailments, which may arise from similar but temporally distinct events involving the same pair of entities. We focus on the sports domain in which the same pairs of teams play on different occasions, with different outcomes. We present an unsupervised model that aims to learn entailments such as win/lose $\rightarrow$ play, while avoiding the pitfall of learning non-entailments such as win $\not\rightarrow$ lose. We evaluate our model on a manually constructed dataset, showing that incorporating time intervals and applying a temporal window around them, are effective strategies.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Modality and Negation in Event Extraction
Authors:
Sander Bijl de Vroe,
Liane Guillou,
Miloš Stanojević,
Nick McKenna,
Mark Steedman
Abstract:
Language provides speakers with a rich system of modality for expressing thoughts about events, without being committed to their actual occurrence. Modality is commonly used in the political news domain, where both actual and possible courses of events are discussed. NLP systems struggle with these semantic phenomena, often incorrectly extracting events which did not happen, which can lead to issu…
▽ More
Language provides speakers with a rich system of modality for expressing thoughts about events, without being committed to their actual occurrence. Modality is commonly used in the political news domain, where both actual and possible courses of events are discussed. NLP systems struggle with these semantic phenomena, often incorrectly extracting events which did not happen, which can lead to issues in downstream applications. We present an open-domain, lexicon-based event extraction system that captures various types of modality. This information is valuable for Question Answering, Knowledge Graph construction and Fact-checking tasks, and our evaluation shows that the system is sufficiently strong to be used in downstream applications.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
Multivalent Entailment Graphs for Question Answering
Authors:
Nick McKenna,
Liane Guillou,
Mohammad Javad Hosseini,
Sander Bijl de Vroe,
Mark Johnson,
Mark Steedman
Abstract:
Drawing inferences between open-domain natural language predicates is a necessity for true language understanding. There has been much progress in unsupervised learning of entailment graphs for this purpose. We make three contributions: (1) we reinterpret the Distributional Inclusion Hypothesis to model entailment between predicates of different valencies, like DEFEAT(Biden, Trump) entails WIN(Bid…
▽ More
Drawing inferences between open-domain natural language predicates is a necessity for true language understanding. There has been much progress in unsupervised learning of entailment graphs for this purpose. We make three contributions: (1) we reinterpret the Distributional Inclusion Hypothesis to model entailment between predicates of different valencies, like DEFEAT(Biden, Trump) entails WIN(Biden); (2) we actualize this theory by learning unsupervised Multivalent Entailment Graphs of open-domain predicates; and (3) we demonstrate the capabilities of these graphs on a novel question answering task. We show that directional entailment is more helpful for inference than bidirectional similarity on questions of fine-grained semantics. We also show that drawing on evidence across valencies answers more questions than by using only the same valency evidence.
△ Less
Submitted 19 September, 2021; v1 submitted 15 April, 2021;
originally announced April 2021.
-
Findings of the 2016 WMT Shared Task on Cross-lingual Pronoun Prediction
Authors:
Liane Guillou,
Christian Hardmeier,
Preslav Nakov,
Sara Stymne,
Jörg Tiedemann,
Yannick Versley,
Mauro Cettolo,
Bonnie Webber,
Andrei Popescu-Belis
Abstract:
We describe the design, the evaluation setup, and the results of the 2016 WMT shared task on cross-lingual pronoun prediction. This is a classification task in which participants are asked to provide predictions on what pronoun class label should replace a placeholder value in the target-language text, provided in lemmatised and PoS-tagged form. We provided four subtasks, for the English-French an…
▽ More
We describe the design, the evaluation setup, and the results of the 2016 WMT shared task on cross-lingual pronoun prediction. This is a classification task in which participants are asked to provide predictions on what pronoun class label should replace a placeholder value in the target-language text, provided in lemmatised and PoS-tagged form. We provided four subtasks, for the English-French and English-German language pairs, in both directions. Eleven teams participated in the shared task; nine for the English-French subtask, five for French-English, nine for English-German, and six for German-English. Most of the submissions outperformed two strong language-model based baseline systems, with systems using deep recurrent neural networks outperforming those using other architectures for most language pairs.
△ Less
Submitted 27 November, 2019;
originally announced November 2019.
-
Pronoun Translation in English-French Machine Translation: An Analysis of Error Types
Authors:
Christian Hardmeier,
Liane Guillou
Abstract:
Pronouns are a long-standing challenge in machine translation. We present a study of the performance of a range of rule-based, statistical and neural MT systems on pronoun translation based on an extensive manual evaluation using the PROTEST test suite, which enables a fine-grained analysis of different pronoun types and sheds light on the difficulties of the task. We find that the rule-based appr…
▽ More
Pronouns are a long-standing challenge in machine translation. We present a study of the performance of a range of rule-based, statistical and neural MT systems on pronoun translation based on an extensive manual evaluation using the PROTEST test suite, which enables a fine-grained analysis of different pronoun types and sheds light on the difficulties of the task. We find that the rule-based approaches in our corpus perform poorly as a result of oversimplification, whereas SMT and early NMT systems exhibit significant shortcomings due to a lack of awareness of the functional and referential properties of pronouns. A recent Transformer-based NMT system with cross-sentence context shows very promising results on non-anaphoric pronouns and intra-sentential anaphora, but there is still considerable room for improvement in examples with cross-sentence dependencies.
△ Less
Submitted 30 August, 2018;
originally announced August 2018.
-
Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point
Authors:
Liane Guillou,
Christian Hardmeier
Abstract:
We compare the performance of the APT and AutoPRF metrics for pronoun translation against a manually annotated dataset comprising human judgements as to the correctness of translations of the PROTEST test suite. Although there is some correlation with the human judgements, a range of issues limit the performance of the automated metrics. Instead, we recommend the use of semi-automatic metrics and…
▽ More
We compare the performance of the APT and AutoPRF metrics for pronoun translation against a manually annotated dataset comprising human judgements as to the correctness of translations of the PROTEST test suite. Although there is some correlation with the human judgements, a range of issues limit the performance of the automated metrics. Instead, we recommend the use of semi-automatic metrics and test suites in place of fully automatic metrics.
△ Less
Submitted 13 August, 2018;
originally announced August 2018.