-
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
Authors:
Marek Kadlčík,
Michal Štefánik,
Timothee Mickus,
Michal Spiegel,
Josef Kuchař
Abstract:
Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure…
▽ More
Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models' representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns.
In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings' preciseness judged by our probe's accuracy explains a large portion of LM's errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes
Authors:
Raúl Vázquez,
Timothee Mickus,
Elaine Zosa,
Teemu Vahtola,
Jörg Tiedemann,
Aman Sinha,
Vincent Segonne,
Fernando Sánchez-Vega,
Alessandro Raganato,
Jindřich Libovický,
Jussi Karlgren,
Shaoxiong Ji,
Jindřich Helcl,
Liane Guillou,
Ona de Gibert,
Jaione Bengoetxea,
Joseph Attieh,
Marianna Apidianaki
Abstract:
We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies…
▽ More
We present the Mu-SHROOM shared task which is focused on detecting hallucinations and other overgeneration mistakes in the output of instruction-tuned large language models (LLMs). Mu-SHROOM addresses general-purpose LLMs in 14 languages, and frames the hallucination detection problem as a span-labeling task. We received 2,618 submissions from 43 participating teams employing diverse methodologies. The large number of submissions underscores the interest of the community in hallucination detection. We present the results of the participating systems and conduct an empirical analysis to identify key factors contributing to strong performance in this task. We also emphasize relevant current challenges, notably the varying degree of hallucinations across languages and the high annotator disagreement when labeling hallucination spans.
△ Less
Submitted 28 April, 2025; v1 submitted 16 April, 2025;
originally announced April 2025.
-
Your Model is Overconfident, and Other Lies We Tell Ourselves
Authors:
Timothee Mickus,
Aman Sinha,
Raúl Vázquez
Abstract:
The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal…
▽ More
The difficulty intrinsic to a given example, rooted in its inherent ambiguity, is a key yet often overlooked factor in evaluating neural NLP models. We investigate the interplay and divergence among various metrics for assessing intrinsic difficulty, including annotator dissensus, training dynamics, and model confidence. Through a comprehensive analysis using 29 models on three datasets, we reveal that while correlations exist among these metrics, their relationships are neither linear nor monotonic. By disentangling these dimensions of uncertainty, we aim to refine our understanding of data complexity and its implications for evaluating and improving NLP models.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Authors:
Zihao Li,
Shaoxiong Ji,
Timothee Mickus,
Vincent Segonne,
Jörg Tiedemann
Abstract:
Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models. One significant caveat of the current state of the art…
▽ More
Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since insights gained from monolingual English models may not necessarily apply to more complex multilingual models. One significant caveat of the current state of the art is that different works are rarely comparable: they often discuss different parameter counts, training data, and evaluation methodology.
This paper proposes a comparison of multilingual pretraining objectives in a controlled methodological environment. We ensure that training data and model architectures are comparable, and discuss the downstream performances across 6 languages that we observe in probing and fine-tuning scenarios. We make two key observations: (1) the architecture dictates which pretraining objective is optimal; (2) multilingual translation is a very effective pretraining objective under the right conditions. We make our code, data, and model weights available at \texttt{\url{https://github.com/Helsinki-NLP/lm-vs-mt}}.
△ Less
Submitted 7 October, 2024; v1 submitted 22 July, 2024;
originally announced July 2024.
-
Domain-specific or Uncertainty-aware models: Does it really make a difference for biomedical text classification?
Authors:
Aman Sinha,
Timothee Mickus,
Marianne Clausel,
Mathieu Constant,
Xavier Coubez
Abstract:
The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the pre…
▽ More
The success of pretrained language models (PLMs) across a spate of use-cases has led to significant investment from the NLP community towards building domain-specific foundational models. On the other hand, in mission critical settings such as biomedical applications, other aspects also factor in-chief of which is a model's ability to produce reasonable estimates of its own uncertainty. In the present study, we discuss these two desiderata through the lens of how they shape the entropy of a model's output probability distribution. We find that domain specificity and uncertainty awareness can often be successfully combined, but the exact task at hand weighs in much more strongly.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling
Authors:
Mariia Fedorova,
Timothee Mickus,
Niko Partanen,
Janine Siewert,
Elena Spaziani,
Andrey Kutuzov
Abstract:
This paper describes the organization and findings of AXOLOTL'24, the first multilingual explainable semantic change modeling shared task. We present new sense-annotated diachronic semantic change datasets for Finnish and Russian which were employed in the shared task, along with a surprise test-only German dataset borrowed from an existing source. The setup of AXOLOTL'24 is new to the semantic ch…
▽ More
This paper describes the organization and findings of AXOLOTL'24, the first multilingual explainable semantic change modeling shared task. We present new sense-annotated diachronic semantic change datasets for Finnish and Russian which were employed in the shared task, along with a surprise test-only German dataset borrowed from an existing source. The setup of AXOLOTL'24 is new to the semantic change modeling field, and involves subtasks of identifying unknown (novel) senses and providing dictionary-like definitions to these senses. The methods of the winning teams are described and compared, thus paving a path towards explainability in computational approaches to historical change of meaning.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
I Have an Attention Bridge to Sell You: Generalization Capabilities of Modular Translation Architectures
Authors:
Timothee Mickus,
Raúl Vázquez,
Joseph Attieh
Abstract:
Modularity is a paradigm of machine translation with the potential of bringing forth models that are large at training time and small during inference. Within this field of study, modular approaches, and in particular attention bridges, have been argued to improve the generalization capabilities of models by fostering language-independent representations. In the present paper, we study whether mod…
▽ More
Modularity is a paradigm of machine translation with the potential of bringing forth models that are large at training time and small during inference. Within this field of study, modular approaches, and in particular attention bridges, have been argued to improve the generalization capabilities of models by fostering language-independent representations. In the present paper, we study whether modularity affects translation quality; as well as how well modular architectures generalize across different evaluation scenarios. For a given computational budget, we find non-modular architectures to be always comparable or preferable to all modular designs we study.
△ Less
Submitted 30 April, 2024; v1 submitted 27 April, 2024;
originally announced April 2024.
-
Can Machine Translation Bridge Multilingual Pretraining and Cross-lingual Transfer Learning?
Authors:
Shaoxiong Ji,
Timothee Mickus,
Vincent Segonne,
Jörg Tiedemann
Abstract:
Multilingual pretraining and fine-tuning have remarkably succeeded in various natural language processing tasks. Transferring representations from one language to another is especially crucial for cross-lingual learning. One can expect machine translation objectives to be well suited to fostering such capabilities, as they involve the explicit alignment of semantically equivalent sentences from di…
▽ More
Multilingual pretraining and fine-tuning have remarkably succeeded in various natural language processing tasks. Transferring representations from one language to another is especially crucial for cross-lingual learning. One can expect machine translation objectives to be well suited to fostering such capabilities, as they involve the explicit alignment of semantically equivalent sentences from different languages. This paper investigates the potential benefits of employing machine translation as a continued training objective to enhance language representation learning, bridging multilingual pretraining and cross-lingual applications. We study this question through two lenses: a quantitative evaluation of the performance of existing models and an analysis of their latent representations. Our results show that, contrary to expectations, machine translation as the continued training fails to enhance cross-lingual representation learning in multiple cross-lingual natural language understanding tasks. We conclude that explicit sentence-level alignment in the cross-lingual scenario is detrimental to cross-lingual transfer pretraining, which has important implications for future cross-lingual transfer studies. We furthermore provide evidence through similarity measures and investigation of parameters that this lack of positive influence is due to output separability -- which we argue is of use for machine translation but detrimental elsewhere.
△ Less
Submitted 25 March, 2024;
originally announced March 2024.
-
SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
Authors:
Timothee Mickus,
Elaine Zosa,
Raúl Vázquez,
Teemu Vahtola,
Jörg Tiedemann,
Vincent Segonne,
Alessandro Raganato,
Marianna Apidianaki
Abstract:
This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 ann…
▽ More
This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling.
The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 27 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled -- many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.
△ Less
Submitted 29 March, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
MAMMOTH: Massively Multilingual Modular Open Translation @ Helsinki
Authors:
Timothee Mickus,
Stig-Arne Grönroos,
Joseph Attieh,
Michele Boggia,
Ona De Gibert,
Shaoxiong Ji,
Niki Andreas Lopi,
Alessandro Raganato,
Raúl Vázquez,
Jörg Tiedemann
Abstract:
NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machin…
▽ More
NLP in the age of monolithic large language models is approaching its limits in terms of size and information that can be handled. The trend goes to modularization, a necessary step into the direction of designing smaller sub-networks and components with specialized functionality. In this paper, we present the MAMMOTH toolkit: a framework designed for training massively multilingual modular machine translation systems at scale, initially derived from OpenNMT-py and then adapted to ensure efficient training across computation clusters. We showcase its efficiency across clusters of A100 and V100 NVIDIA GPUs, and discuss our design philosophy and plans for future information. The toolkit is publicly available online.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Isotropy, Clusters, and Classifiers
Authors:
Timothee Mickus,
Stig-Arne Grönroos,
Joseph Attieh
Abstract:
Whether embedding spaces use all their dimensions equally, i.e., whether they are isotropic, has been a recent subject of discussion. Evidence has been accrued both for and against enforcing isotropy in embedding spaces. In the present paper, we stress that isotropy imposes requirements on the embedding space that are not compatible with the presence of clusters -- which also negatively impacts li…
▽ More
Whether embedding spaces use all their dimensions equally, i.e., whether they are isotropic, has been a recent subject of discussion. Evidence has been accrued both for and against enforcing isotropy in embedding spaces. In the present paper, we stress that isotropy imposes requirements on the embedding space that are not compatible with the presence of clusters -- which also negatively impacts linear classification objectives. We demonstrate this fact both mathematically and empirically and use it to shed light on previous results from the literature.
△ Less
Submitted 27 May, 2024; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding
Authors:
Timothee Mickus,
Elaine Zosa,
Denis Paperno
Abstract:
Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or agains…
▽ More
Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems.
In this paper, we establish a methodological framework for studying what the effects are - if any - of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.
△ Less
Submitted 18 October, 2023;
originally announced October 2023.
-
Why bother with geometry? On the relevance of linear decompositions of Transformer embeddings
Authors:
Timothee Mickus,
Raúl Vázquez
Abstract:
A recent body of work has demonstrated that Transformer embeddings can be linearly decomposed into well-defined sums of factors, that can in turn be related to specific network inputs or components. There is however still a dearth of work studying whether these mathematical reformulations are empirically meaningful. In the present work, we study representations from machine-translation decoders us…
▽ More
A recent body of work has demonstrated that Transformer embeddings can be linearly decomposed into well-defined sums of factors, that can in turn be related to specific network inputs or components. There is however still a dearth of work studying whether these mathematical reformulations are empirically meaningful. In the present work, we study representations from machine-translation decoders using two of such embedding decomposition methods. Our results indicate that, while decomposition-derived indicators effectively correlate with model performance, variation across different runs suggests a more nuanced take on this question. The high variability of our measurements indicate that geometry reflects model-specific characteristics more than it does sentence-specific computations, and that similar training conditions do not guarantee similar vector spaces.
△ Less
Submitted 10 October, 2023;
originally announced October 2023.
-
"Definition Modeling: To model definitions." Generating Definitions With Little to No Semantics
Authors:
Vincent Segonne,
Timothee Mickus
Abstract:
Definition Modeling, the task of generating definitions, was first proposed as a means to evaluate the semantic quality of word embeddings-a coherent lexical semantic representations of a word in context should contain all the information necessary to generate its definition. The relative novelty of this task entails that we do not know which factors are actually relied upon by a Definition Modeli…
▽ More
Definition Modeling, the task of generating definitions, was first proposed as a means to evaluate the semantic quality of word embeddings-a coherent lexical semantic representations of a word in context should contain all the information necessary to generate its definition. The relative novelty of this task entails that we do not know which factors are actually relied upon by a Definition Modeling system. In this paper, we present evidence that the task may not involve as much semantics as one might expect: we show how an earlier model from the literature is both rather insensitive to semantic aspects such as explicit polysemy, as well as reliant on formal similarities between headwords and words occurring in its glosses, casting doubt on the validity of the task as a means to evaluate embeddings.
△ Less
Submitted 14 June, 2023;
originally announced June 2023.
-
How to Dissect a Muppet: The Structure of Transformer Embedding Spaces
Authors:
Timothee Mickus,
Denis Paperno,
Mathieu Constant
Abstract:
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative o…
▽ More
Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.
△ Less
Submitted 7 June, 2022;
originally announced June 2022.
-
Semeval-2022 Task 1: CODWOE -- Comparing Dictionaries and Word Embeddings
Authors:
Timothee Mickus,
Kees van Deemter,
Mathieu Constant,
Denis Paperno
Abstract:
Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries. This problem naturally divides into two subtasks: converting definitions into e…
▽ More
Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries. This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously.
△ Less
Submitted 27 May, 2022;
originally announced May 2022.
-
A Game Interface to Study Semantic Grounding in Text-Based Models
Authors:
Timothee Mickus,
Mathieu Constant,
Denis Paperno
Abstract:
Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is o…
▽ More
Can language models learn grounded representations from text distribution alone? This question is both central and recurrent in natural language processing; authors generally agree that grounding requires more than textual distribution. We propose to experimentally test this claim: if any two words have different meanings and yet cannot be distinguished from distribution alone, then grounding is out of the reach of text-based models. To that end, we present early work on an online game for the collection of human judgments on the distributional similarity of word pairs in five languages. We further report early results of our data collection campaign.
△ Less
Submitted 17 August, 2021;
originally announced August 2021.
-
What Meaning-Form Correlation Has to Compose With
Authors:
Timothee Mickus,
Timothée Bernard,
Denis Paperno
Abstract:
Compositionality is a widely discussed property of natural languages, although its exact definition has been elusive. We focus on the proposal that compositionality can be assessed by measuring meaning-form correlation. We analyze meaning-form correlation on three sets of languages: (i) artificial toy languages tailored to be compositional, (ii) a set of English dictionary definitions, and (iii) a…
▽ More
Compositionality is a widely discussed property of natural languages, although its exact definition has been elusive. We focus on the proposal that compositionality can be assessed by measuring meaning-form correlation. We analyze meaning-form correlation on three sets of languages: (i) artificial toy languages tailored to be compositional, (ii) a set of English dictionary definitions, and (iii) a set of English sentences drawn from literature. We find that linguistic phenomena such as synonymy and ungrounded stop-words weigh on MFC measurements, and that straightforward methods to mitigate their effects have widely varying results depending on the dataset they are applied to. Data and code are made publicly available.
△ Less
Submitted 7 December, 2020;
originally announced December 2020.
-
IRIS: A Low Duty Cycle Cross-Layer Protocol for Long-Range Wireless Sensor Networks with Low Power Budget
Authors:
Yi Chu,
Paul Mitchell,
David Grace,
Jonathan Roberts,
Dominic White,
Tautvydas Mickus
Abstract:
This paper presents a cross-layer protocol (IRIS) designed for long-range pipeline Wireless Sensor Networks with extremely low power budget, typically seen in a range of monitoring applications. IRIS uses ping packets initiated by a base station to travel through the multi-hop network and carry monitoring information. The protocol is able to operate with less than 1% duty cycle, thereby conforming…
▽ More
This paper presents a cross-layer protocol (IRIS) designed for long-range pipeline Wireless Sensor Networks with extremely low power budget, typically seen in a range of monitoring applications. IRIS uses ping packets initiated by a base station to travel through the multi-hop network and carry monitoring information. The protocol is able to operate with less than 1% duty cycle, thereby conforming to ISM band spectrum regulations in the 868MHz band. The duty cycle can be flexibly configured to meet other regulations/power budgets as well as to improve the route forming performance. Simulation results show guaranteed route formation in different network topologies with various protocol configurations. System robustness against unreliable wireless connections and node failures are also demonstrated by simulations.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
What do you mean, BERT? Assessing BERT as a Distributional Semantics Model
Authors:
Timothee Mickus,
Denis Paperno,
Mathieu Constant,
Kees van Deemter
Abstract:
Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing…
▽ More
Contextualized word embeddings, i.e. vector representations for words in context, are naturally seen as an extension of previous noncontextual distributional semantic models. In this work, we focus on BERT, a deep neural network that produces contextualized embeddings and has set the state-of-the-art in several semantic tasks, and study the semantic coherence of its embedding space. While showing a tendency towards coherence, BERT does not fully live up to the natural expectations for a semantic vector space. In particular, we find that the position of the sentence in which a word occurs, while having no meaning correlates, leaves a noticeable trace on the word embeddings and disturbs similarity relationships.
△ Less
Submitted 8 May, 2020; v1 submitted 13 November, 2019;
originally announced November 2019.
-
Mark my Word: A Sequence-to-Sequence Approach to Definition Modeling
Authors:
Timothee Mickus,
Denis Paperno,
Mathieu Constant
Abstract:
Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generat…
▽ More
Defining words in a textual context is a useful task both for practical purposes and for gaining insight into distributed word representations. Building on the distributional hypothesis, we argue here that the most natural formalization of definition modeling is to treat it as a sequence-to-sequence task, rather than a word-to-sequence task: given an input sequence with a highlighted word, generate a contextually appropriate definition for it. We implement this approach in a Transformer-based sequence-to-sequence model. Our proposal allows to train contextualization and definition generation in an end-to-end fashion, which is a conceptual improvement over earlier works. We achieve state-of-the-art results both in contextual and non-contextual definition modeling.
△ Less
Submitted 13 November, 2019;
originally announced November 2019.