Skip to main content

Showing 1–50 of 58 results for author: Camacho-Collados, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2412.15375  [pdf, other

    cs.CL

    Automatic Extraction of Metaphoric Analogies from Literary Texts: Task Formulation, Dataset Construction, and Evaluation

    Authors: Joanne Boisson, Zara Siddique, Hsuvas Borkakoty, Dimosthenis Antypas, Luis Espinosa Anke, Jose Camacho-Collados

    Abstract: Extracting metaphors and analogies from free text requires high-level reasoning abilities such as abstraction and language understanding. Our study focuses on the extraction of the concepts that form metaphoric analogies in literary texts. To this end, we construct a novel dataset in this domain with the help of domain experts. We compare the out-of-the-box ability of recent large language models… ▽ More

    Submitted 19 December, 2024; originally announced December 2024.

    Comments: Accepted to COLING 2025, long paper

  2. arXiv:2411.19832  [pdf, ps, other

    cs.CL

    Sensitive Content Classification in Social Media: A Holistic Resource and Evaluation

    Authors: Dimosthenis Antypas, Indira Sen, Carla Perez-Almendros, Jose Camacho-Collados, Francesco Barbieri

    Abstract: The detection of sensitive content in large datasets is crucial for ensuring that shared and analysed data is free from harmful material. However, current moderation tools, such as external APIs, suffer from limitations in customisation, accuracy across diverse sensitive categories, and privacy concerns. Additionally, existing datasets and open-source models focus predominantly on toxic language,… ▽ More

    Submitted 24 June, 2025; v1 submitted 29 November, 2024; originally announced November 2024.

    Comments: Accepted at the 9th Workshop on Online Abuse and Harms (WOAH)

    ACM Class: I.2.7

  3. arXiv:2411.18260  [pdf, other

    cs.CL

    MetaphorShare: A Dynamic Collaborative Repository of Open Metaphor Datasets

    Authors: Joanne Boisson, Arif Mehmood, Jose Camacho-Collados

    Abstract: The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an ident… ▽ More

    Submitted 10 March, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

    Comments: Accepted in NAACL 2025 system demonstration track

  4. arXiv:2410.03075  [pdf, other

    cs.CL

    Multilingual Topic Classification in X: Dataset and Analysis

    Authors: Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Jose Camacho-Collados

    Abstract: In the dynamic realm of social media, diverse topics are discussed daily, transcending linguistic boundaries. However, the complexities of understanding and categorising this content across various languages remain an important challenge with traditional techniques like topic modelling often struggling to accommodate this multilingual diversity. In this paper, we introduce X-Topic, a multilingual… ▽ More

    Submitted 3 October, 2024; originally announced October 2024.

    Comments: Accepted at EMNLP 2024

  5. arXiv:2409.20246  [pdf, other

    cs.CL

    Analysing Zero-Shot Readability-Controlled Sentence Simplification

    Authors: Abdullah Barayan, Jose Camacho-Collados, Fernando Alva-Manchego

    Abstract: Readability-controlled text simplification (RCTS) rewrites texts to lower readability levels while preserving their meaning. RCTS models often depend on parallel corpora with readability annotations on both source and target sides. Such datasets are scarce and difficult to curate, especially at the sentence level. To reduce reliance on parallel data, we explore using instruction-tuned large langua… ▽ More

    Submitted 16 December, 2024; v1 submitted 30 September, 2024; originally announced September 2024.

    Comments: Accepted on COLING 2025

  6. arXiv:2406.13556  [pdf, other

    cs.CL

    Evaluating Short-Term Temporal Fluctuations of Social Biases in Social Media Data and Masked Language Models

    Authors: Yi Zhou, Danushka Bollegala, Jose Camacho-Collados

    Abstract: Social biases such as gender or racial biases have been reported in language models (LMs), including Masked Language Models (MLMs). Given that MLMs are continuously trained with increasing amounts of additional data collected over time, an important yet unanswered question is how the social biases encoded with MLMs vary over time. In particular, the number of social media users continues to grow a… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  7. arXiv:2406.09948  [pdf, other

    cs.CL

    BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages

    Authors: Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Víctor Gutiérrez-Basulto, Yazmín Ibáñez-García, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose Camacho-Collados, Alice Oh

    Abstract: Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food… ▽ More

    Submitted 15 January, 2025; v1 submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to NeurIPS 2024 Datasets & Benchmark Track

  8. arXiv:2405.13017  [pdf, other

    cs.CL cs.LG

    A Systematic Analysis on the Temporal Generalization of Language Models in Social Media

    Authors: Asahi Ushio, Jose Camacho-Collados

    Abstract: In machine learning, temporal shifts occur when there are differences between training and test splits in terms of time. For streaming data such as news or social media, models are commonly trained on a fixed corpus from a certain period of time, and they can become obsolete due to the dynamism and evolving nature of online content. This paper focuses on temporal shifts in social media and, in par… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

  9. arXiv:2405.10213  [pdf, ps, other

    cs.SI cs.CL cs.CY

    Words as Trigger Points in Social Media Discussions: A Large-Scale Case Study about UK Politics on Reddit

    Authors: Dimosthenis Antypas, Christian Arnold, Jose Camacho-Collados, Nedjma Ousidhoum, Carla Perez Almendros

    Abstract: Political debates on social media sometimes flare up. From that moment on, users engage much more with one another; their communication is also more emotional and polarised. While it has been difficult to grasp such moments with computational methods, we suggest that trigger points are a useful concept to understand and ultimately model such behaviour. Established in qualitative focus group interv… ▽ More

    Submitted 24 June, 2025; v1 submitted 16 May, 2024; originally announced May 2024.

  10. arXiv:2403.17661  [pdf, other

    cs.CL cs.AI

    Language Models for Text Classification: Is In-Context Learning Enough?

    Authors: Aleksandra Edwards, Jose Camacho-Collados

    Abstract: Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches based on fine-tuning is the ability to understand instructions written in natural language (prompts), which helps them generalise better to different tasks and domains without the need for specific training data. Th… ▽ More

    Submitted 14 April, 2024; v1 submitted 26 March, 2024; originally announced March 2024.

    Comments: Accepted at LREC-COLING 2024

  11. arXiv:2311.00790  [pdf, other

    cs.CL

    Construction Artifacts in Metaphor Identification Datasets

    Authors: Joanne Boisson, Luis Espinosa-Anke, Jose Camacho-Collados

    Abstract: Metaphor identification aims at understanding whether a given expression is used figuratively in context. However, in this paper we show how existing metaphor identification datasets can be gamed by fully ignoring the potential metaphorical expression or the context in which it occurs. We test this hypothesis in a variety of datasets and settings, and show that metaphor identification systems base… ▽ More

    Submitted 15 November, 2023; v1 submitted 1 November, 2023; originally announced November 2023.

    Comments: Short paper accepted to EMNLP 2023 main conference

  12. arXiv:2310.14757  [pdf, other

    cs.CL

    SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research

    Authors: Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Leonardo Neves, Kiamehr Rezaee, Luis Espinosa-Anke, Jiaxin Pei, Jose Camacho-Collados

    Abstract: Despite its relevance, the maturity of NLP for social media pales in comparison with general-purpose models, metrics and benchmarks. This fragmented landscape makes it hard for the community to know, for instance, given a task, which is the best performing model and how it compares with others. To alleviate this issue, we introduce a unified benchmark for NLP evaluation in social media, SuperTweet… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  13. arXiv:2310.12936  [pdf, other

    cs.CL

    A Predictive Factor Analysis of Social Biases and Task-Performance in Pretrained Masked Language Models

    Authors: Yi Zhou, Jose Camacho-Collados, Danushka Bollegala

    Abstract: Various types of social biases have been reported with pretrained Masked Language Models (MLMs) in prior work. However, multiple underlying factors are associated with an MLM such as its model size, size of the training data, training objectives, the domain from which pretraining data is sampled, tokenization, and languages present in the pretrained corpora, to name a few. It remains unclear as to… ▽ More

    Submitted 22 October, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 main conference

  14. arXiv:2310.00299  [pdf, other

    cs.CL

    RelBERT: Embedding Relations with Language Models

    Authors: Asahi Ushio, Jose Camacho-Collados, Steven Schockaert

    Abstract: Many applications need access to background knowledge about how different concepts and entities are related. Although Knowledge Graphs (KG) and Large Language Models (LLM) can address this need to some extent, KGs are inevitably incomplete and their relational schema is often too coarse-grained, while LLMs are inefficient and difficult to control. As an alternative, we propose to extract relation… ▽ More

    Submitted 8 October, 2023; v1 submitted 30 September, 2023; originally announced October 2023.

  15. arXiv:2308.16705  [pdf, other

    cs.CL cs.AI

    Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis

    Authors: Nayeon Lee, Chani Jung, Junho Myung, Jiho Jin, Jose Camacho-Collados, Juho Kim, Alice Oh

    Abstract: Warning: this paper contains content that may be offensive or upsetting. Most hate speech datasets neglect the cultural diversity within a single language, resulting in a critical shortcoming in hate speech detection. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. To construct CREHate, we follow a two-step procedure: 1) cultural post collection and 2) cross-… ▽ More

    Submitted 3 April, 2024; v1 submitted 31 August, 2023; originally announced August 2023.

    Comments: Accepted to NAACL 2024 Main Conference

  16. arXiv:2308.02142  [pdf, other

    cs.CL cs.SI

    Tweet Insights: A Visualization Platform to Extract Temporal Insights from Twitter

    Authors: Daniel Loureiro, Kiamehr Rezaee, Talayeh Riahi, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, Jose Camacho-Collados

    Abstract: This paper introduces a large collection of time series data derived from Twitter, postprocessed using word embedding techniques, as well as specialized fine-tuned language models. This data comprises the past five years and captures changes in n-gram frequency, similarity, sentiment and topic distribution. The interface built on top of this data enables temporal analysis for detecting and charact… ▽ More

    Submitted 4 August, 2023; originally announced August 2023.

    Comments: Demo paper. Visualization platform available at https://tweetnlp.org/insights

  17. arXiv:2307.01680  [pdf, other

    cs.CL

    Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation

    Authors: Dimosthenis Antypas, Jose Camacho-Collados

    Abstract: The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-datase… ▽ More

    Submitted 4 July, 2023; originally announced July 2023.

    Comments: Accepted in "Workshop on Online Abuse and Harms (WOAH)", 2023

    ACM Class: I.2.7

  18. arXiv:2305.17416  [pdf, other

    cs.CL

    A Practical Toolkit for Multilingual Question and Answer Generation

    Authors: Asahi Ushio, Fernando Alva-Manchego, Jose Camacho-Collados

    Abstract: Generating questions along with associated answers from a text has applications in several domains, such as creating reading comprehension tests for students, or improving document search by providing auxiliary questions and answers based on the query. Training models for question and answer generation (QAG) is not straightforward due to the expected structured output (i.e. a list of question and… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

    Comments: Accepted by ACL 2023 System Demonstration

  19. arXiv:2305.17002  [pdf, other

    cs.CL

    An Empirical Comparison of LM-based Question and Answer Generation Methods

    Authors: Asahi Ushio, Fernando Alva-Manchego, Jose Camacho-Collados

    Abstract: Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context (e.g. a paragraph). This task has a variety of applications, such as data augmentation for question answering (QA) models, information retrieval and education. In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) f… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by ACL 2023 Findings

  20. arXiv:2305.15020  [pdf, other

    cs.CL

    An Efficient Multilingual Language Model Compression through Vocabulary Trimming

    Authors: Asahi Ushio, Yi Zhou, Jose Camacho-Collados

    Abstract: Multilingual language model (LM) have become a powerful tool in NLP especially for non-English languages. Nevertheless, model parameters of multilingual LMs remain large due to the larger embedding matrix of the vocabulary covering tokens in different languages. On the contrary, monolingual LMs can be trained in a target language with the language-specific vocabulary only, but this requires a larg… ▽ More

    Submitted 19 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023 findings

  21. arXiv:2210.03992  [pdf, other

    cs.CL

    Generative Language Models for Paragraph-Level Question Generation

    Authors: Asahi Ushio, Fernando Alva-Manchego, Jose Camacho-Collados

    Abstract: Powerful generative models have led to recent progress in question generation (QG). However, it is difficult to measure advances in QG research since there are no standardized resources that allow a uniform comparison among approaches. In this paper, we introduce QG-Bench, a multilingual and multidomain benchmark for QG that unifies existing question answering datasets by converting them to a stan… ▽ More

    Submitted 2 January, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

    Comments: EMNLP 2022 main conference

  22. arXiv:2210.03797  [pdf, other

    cs.CL

    Named Entity Recognition in Twitter: A Dataset and Analysis on Short-Term Temporal Shifts

    Authors: Asahi Ushio, Leonardo Neves, Vitor Silva, Francesco Barbieri, Jose Camacho-Collados

    Abstract: Recent progress in language model pre-training has led to important improvements in Named Entity Recognition (NER). Nonetheless, this progress has been mainly tested in well-formatted documents such as news, Wikipedia, or scientific articles. In social media the landscape is different, in which it adds another layer of complexity due to its noisy and dynamic nature. In this paper, we focus on NER… ▽ More

    Submitted 15 November, 2022; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: AACL 2022 main conference

  23. T-NER: An All-Round Python Library for Transformer-based Named Entity Recognition

    Authors: Asahi Ushio, Jose Camacho-Collados

    Abstract: Language model (LM) pretraining has led to consistent improvements in many NLP downstream tasks, including named entity recognition (NER). In this paper, we present T-NER (Transformer-based Named Entity Recognition), a Python library for NER LM finetuning. In addition to its practical utility, T-NER facilitates the study and investigation of the cross-domain and cross-lingual generalization abilit… ▽ More

    Submitted 9 September, 2022; originally announced September 2022.

    Comments: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021): System Demonstrations

  24. arXiv:2209.09824  [pdf, other

    cs.CL

    Twitter Topic Classification

    Authors: Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Leonardo Neves, Vítor Silva, Francesco Barbieri

    Abstract: Social media platforms host discussions about a wide variety of topics that arise everyday. Making sense of all the content and organising it into categories is an arduous task. A common way to deal with this issue is relying on topic modeling, but topics discovered using this technique are difficult to interpret and can differ from corpus to corpus. In this paper, we present a new task based on t… ▽ More

    Submitted 20 September, 2022; originally announced September 2022.

    Comments: Accepted at COLING 2022

  25. arXiv:2209.07216  [pdf, other

    cs.CL

    TempoWiC: An Evaluation Benchmark for Detecting Meaning Shift in Social Media

    Authors: Daniel Loureiro, Aminette D'Souza, Areej Nasser Muhajab, Isabella A. White, Gabriel Wong, Luis Espinosa Anke, Leonardo Neves, Francesco Barbieri, Jose Camacho-Collados

    Abstract: Language evolves over time, and word meaning changes accordingly. This is especially true in social media, since its dynamic nature leads to faster semantic shifts, making it challenging for NLP models to deal with new content and trends. However, the number of datasets and models that specifically address the dynamic nature of these social platforms is scarce. To bridge this gap, we present Tempo… ▽ More

    Submitted 16 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

    Comments: Accepted to COLING 2022. Used to create the TempoWiC Shared Task for EvoNLP

  26. arXiv:2206.14774  [pdf, other

    cs.CL

    TweetNLP: Cutting-Edge Natural Language Processing for Social Media

    Authors: Jose Camacho-Collados, Kiamehr Rezaee, Talayeh Riahi, Asahi Ushio, Daniel Loureiro, Dimosthenis Antypas, Joanne Boisson, Luis Espinosa-Anke, Fangyu Liu, Eugenio Martínez-Cámara, Gonzalo Medina, Thomas Buhrmann, Leonardo Neves, Francesco Barbieri

    Abstract: In this paper we present TweetNLP, an integrated platform for Natural Language Processing (NLP) in social media. TweetNLP supports a diverse set of NLP tasks, including generic focus areas such as sentiment analysis and named entity recognition, as well as social media-specific tasks such as emoji prediction and offensive language identification. Task-specific systems are powered by reasonably-siz… ▽ More

    Submitted 25 October, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: EMNLP 2022 Demo paper. TweetNLP: https://tweetnlp.org/

  27. arXiv:2205.07603  [pdf, other

    cs.CL cs.AI

    Assessing the Limits of the Distributional Hypothesis in Semantic Spaces: Trait-based Relational Knowledge and the Impact of Co-occurrences

    Authors: Mark Anderson, Jose Camacho-Collados

    Abstract: The increase in performance in NLP due to the prevalence of distributional models and deep learning has brought with it a reciprocal decrease in interpretability. This has spurred a focus on what neural networks learn about natural language with less of a focus on how. Some work has focused on the data used to develop data-driven models, but typically this line of work aims to highlight issues wit… ▽ More

    Submitted 16 May, 2022; originally announced May 2022.

    Comments: Due to appear in the proceedings of *SEM 2022: The 11th Joint Conference on Lexical and Computational Semantics

  28. arXiv:2202.03829  [pdf, other

    cs.CL cs.AI

    TimeLMs: Diachronic Language Models from Twitter

    Authors: Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, Jose Camacho-Collados

    Abstract: Despite its importance, the time variable has been largely neglected in the NLP and language model literature. In this paper, we present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models' capacity to deal with future and out-of-distribution tweets, while making them competitive… ▽ More

    Submitted 1 April, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: Accepted to ACL 2022 (Demo Track) - https://github.com/cardiffnlp/timelms

  29. Negativity Spreads Faster: A Large-Scale Multilingual Twitter Analysis on the Role of Sentiment in Political Communication

    Authors: Dimosthenis Antypas, Alun Preece, Jose Camacho-Collados

    Abstract: Social media has become extremely influential when it comes to policy making in modern societies, especially in the western world, where platforms such as Twitter allow users to follow politicians, thus making citizens more involved in political discussion. In the same vein, politicians use Twitter to express their opinions, debate among others on current topics and promote their political agendas… ▽ More

    Submitted 3 April, 2023; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: Accepted at "Online Social Networks and Media, Volume 33"; for code and data used see https://github.com/cardiffnlp/politics-and-virality-twitter

    ACM Class: I.2.7

    Journal ref: Online Social Networks and Media, 2023, Volume 33 Online Social Networks and Media

  30. arXiv:2111.09064  [pdf, other

    cs.CL

    Guiding Generative Language Models for Data Augmentation in Few-Shot Text Classification

    Authors: Aleksandra Edwards, Asahi Ushio, Jose Camacho-Collados, Hélène de Ribaupierre, Alun Preece

    Abstract: Data augmentation techniques are widely used for enhancing the performance of machine learning models by tackling class imbalance issues and data sparsity. State-of-the-art generative language models have been shown to provide significant gains across different NLP tasks. However, their applicability to data augmentation for text classification tasks in few-shot settings have not been fully explor… ▽ More

    Submitted 9 January, 2023; v1 submitted 17 November, 2021; originally announced November 2021.

    Comments: Paper has been accepted and presented at DASH workshop, EMNLP 2022 conference

  31. Distilling Relation Embeddings from Pre-trained Language Models

    Authors: Asahi Ushio, Jose Camacho-Collados, Steven Schockaert

    Abstract: Pre-trained language models have been found to capture a surprisingly rich amount of lexical knowledge, ranging from commonsense properties of everyday concepts to detailed factual knowledge about named entities. Among others, this makes it possible to distill high-quality word vectors from pre-trained language models. However, it is currently unclear to what extent it is possible to distill relat… ▽ More

    Submitted 21 September, 2021; originally announced October 2021.

    Comments: EMNLP 2021 main conference

  32. arXiv:2108.03067  [pdf, other

    cs.CL cs.LG

    Deriving Disinformation Insights from Geolocalized Twitter Callouts

    Authors: David Tuxworth, Dimosthenis Antypas, Luis Espinosa-Anke, Jose Camacho-Collados, Alun Preece, David Rogers

    Abstract: This paper demonstrates a two-stage method for deriving insights from social media data relating to disinformation by applying a combination of geospatial classification and embedding-based language modelling across multiple languages. In particular, the analysis in centered on Twitter and disinformation for three European languages: English, French and Spanish. Firstly, Twitter data is classified… ▽ More

    Submitted 6 August, 2021; originally announced August 2021.

    Comments: Accepted for presentation at KDD 2021 - Workshop On Deriving Insights From User-Generated Text

  33. LMMS Reloaded: Transformer-based Sense Embeddings for Disambiguation and Beyond

    Authors: Daniel Loureiro, Alípio Mário Jorge, Jose Camacho-Collados

    Abstract: Distributional semantics based on neural approaches is a cornerstone of Natural Language Processing, with surprising connections to human meaning representation as well. Recent Transformer-based Language Models have proven capable of producing contextual word representations that reliably convey sense-specific information, simply as a product of self-supervision. Prior work has shown that these co… ▽ More

    Submitted 1 April, 2022; v1 submitted 26 May, 2021; originally announced May 2021.

    Comments: Accepted to Artificial Intelligence Journal (AIJ)

    Journal ref: Artificial Intelligence Volume 305, April 2022, 103661

  34. arXiv:2105.04949  [pdf, other

    cs.CL cs.LG

    BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?

    Authors: Asahi Ushio, Luis Espinosa-Anke, Steven Schockaert, Jose Camacho-Collados

    Abstract: Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as "eye is to seeing what ear is to hearing", sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we an… ▽ More

    Submitted 9 September, 2022; v1 submitted 11 May, 2021; originally announced May 2021.

    Comments: Accepted by ACL 2021 main conference

  35. arXiv:2104.12250  [pdf, other

    cs.CL

    XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond

    Authors: Francesco Barbieri, Luis Espinosa Anke, Jose Camacho-Collados

    Abstract: Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a model to train and evaluate multilingua… ▽ More

    Submitted 11 May, 2022; v1 submitted 25 April, 2021; originally announced April 2021.

    Comments: LREC 2022. Code and data available at https://github.com/cardiffnlp/xlm-t

  36. Back to the Basics: A Quantitative Analysis of Statistical and Graph-Based Term Weighting Schemes for Keyword Extraction

    Authors: Asahi Ushio, Federico Liberatore, Jose Camacho-Collados

    Abstract: Term weighting schemes are widely used in Natural Language Processing and Information Retrieval. In particular, term weighting is the basis for keyword extraction. However, there are relatively few evaluation studies that shed light about the strengths and shortcomings of each weighting scheme. In fact, in most cases researchers and practitioners resort to the well-known tf-idf as default, despite… ▽ More

    Submitted 13 September, 2021; v1 submitted 16 April, 2021; originally announced April 2021.

    Comments: Accepted by EMNLP 2021 main conference

    Report number: 2021.emnlp-main.638

  37. arXiv:2010.14584  [pdf, other

    cs.CL

    Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports

    Authors: Aleksandra Edwards, David Rogers, Jose Camacho-Collados, Hélène de Ribaupierre, Alun Preece

    Abstract: The task of text and sentence classification is associated with the need for large amounts of labelled training data. The acquisition of high volumes of labelled datasets can be expensive or unfeasible, especially for highly-specialised domains for which documents are hard to obtain. Research on the application of supervised classification based on small amounts of training data is limited. In thi… ▽ More

    Submitted 4 June, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: 10 pages, 5 figures, workshop

  38. arXiv:2010.12421  [pdf, other

    cs.CL cs.SI

    TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification

    Authors: Francesco Barbieri, Jose Camacho-Collados, Leonardo Neves, Luis Espinosa-Anke

    Abstract: The experimental landscape in natural language processing for social media is too fragmented. Each year, new shared tasks and datasets are proposed, ranging from classics like sentiment analysis to irony detection or emoji prediction. Therefore, it is unclear what the current state of the art is, as there is no standardized evaluation protocol, neither a strong set of baselines trained on such dom… ▽ More

    Submitted 26 October, 2020; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: Findings of EMNLP 2020. TweetEval benchmark available at https://github.com/cardiffnlp/tweeteval

  39. arXiv:2010.06478  [pdf, other

    cs.CL

    XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

    Authors: Alessandro Raganato, Tommaso Pasini, Jose Camacho-Collados, Mohammad Taher Pilehvar

    Abstract: The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence… ▽ More

    Submitted 13 October, 2020; originally announced October 2020.

    Comments: EMNLP2020

  40. arXiv:2008.11608  [pdf, other

    cs.CL

    Analysis and Evaluation of Language Models for Word Sense Disambiguation

    Authors: Daniel Loureiro, Kiamehr Rezaee, Mohammad Taher Pilehvar, Jose Camacho-Collados

    Abstract: Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering wor… ▽ More

    Submitted 17 March, 2021; v1 submitted 26 August, 2020; originally announced August 2020.

    Comments: 55 pages, accepted to Computational Linguistics

  41. arXiv:2004.15016  [pdf, ps, other

    cs.CL

    WiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context

    Authors: Anna Breit, Artem Revenko, Kiamehr Rezaee, Mohammad Taher Pilehvar, Jose Camacho-Collados

    Abstract: We present WiC-TSV, a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, we introduce a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as a binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for t… ▽ More

    Submitted 27 January, 2021; v1 submitted 30 April, 2020; originally announced April 2020.

    Comments: Accepted to EACL 2021. Reference paper of the SemDeep WiC-TSV challenge: https://competitions.codalab.org/competitions/23683

  42. arXiv:2004.14325  [pdf, other

    cs.CL

    Don't Neglect the Obvious: On the Role of Unambiguous Words in Word Sense Disambiguation

    Authors: Daniel Loureiro, Jose Camacho-Collados

    Abstract: State-of-the-art methods for Word Sense Disambiguation (WSD) combine two different features: the power of pre-trained language models and a propagation method to extend the coverage of such models. This propagation is needed as current sense-annotated corpora lack coverage of many instances in the underlying sense inventory (usually WordNet). At the same time, unambiguous words make for a large po… ▽ More

    Submitted 23 October, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted to EMNLP 2020. Website: http://danlou.github.io/uwa

  43. arXiv:1912.01220  [pdf, other

    cs.CL cs.AI

    Modelling Semantic Categories using Conceptual Neighborhood

    Authors: Zied Bouraoui, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: While many methods for learning vector space embeddings have been proposed in the field of Natural Language Processing, these methods typically do not distinguish between categories and individuals. Intuitively, if individuals are represented as vectors, we can think of categories as (soft) regions in the embedding space. Unfortunately, meaningful regions can be difficult to estimate, especially s… ▽ More

    Submitted 3 December, 2019; originally announced December 2019.

    Comments: Accepted to AAAI 2020

  44. arXiv:1911.12753  [pdf, ps, other

    cs.CL cs.AI

    Inducing Relational Knowledge from BERT

    Authors: Zied Bouraoui, Jose Camacho-Collados, Steven Schockaert

    Abstract: One of the most remarkable properties of word embeddings is the fact that they capture certain types of semantic and syntactic relationships. Recently, pre-trained language models such as BERT have achieved groundbreaking results across a wide range of Natural Language Processing tasks. However, it is unclear to what extent such models capture relational knowledge beyond what is already captured b… ▽ More

    Submitted 28 November, 2019; originally announced November 2019.

    Comments: Accepted to AAAI 2020

  45. arXiv:1910.07221  [pdf, other

    cs.CL

    Meemi: A Simple Method for Post-processing and Integrating Cross-lingual Word Embeddings

    Authors: Yerai Doval, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: Word embeddings have become a standard resource in the toolset of any Natural Language Processing practitioner. While monolingual word embeddings encode information about words in the context of a particular language, cross-lingual embeddings define a multilingual space where word embeddings from two or more languages are integrated together. Current state-of-the-art approaches learn these embeddi… ▽ More

    Submitted 11 November, 2020; v1 submitted 16 October, 2019; originally announced October 2019.

    Comments: 22 pages, 2 figures, 9 tables. Preprint submitted to Natural Language Engineering

    MSC Class: 68T50

  46. arXiv:1908.07742  [pdf, other

    cs.CL

    On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

    Authors: Yerai Doval, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: Cross-lingual word embeddings are vector representations of words in different languages where words with similar meaning are represented by similar vectors, regardless of the language. Recent developments which construct these embeddings by aligning monolingual spaces have shown that accurate alignments can be obtained with little or no supervision. However, the focus has been on a particular con… ▽ More

    Submitted 3 March, 2020; v1 submitted 21 August, 2019; originally announced August 2019.

    Comments: 11 pages, 2 figures, 7 tables. Camera-ready submitted to LREC 2020

  47. arXiv:1906.01373  [pdf, other

    cs.CL

    Relational Word Embeddings

    Authors: Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: While word embeddings have been shown to implicitly encode various forms of attributional knowledge, the extent to which they capture relational information is far more limited. In previous work, this limitation has been addressed by incorporating relational knowledge from external knowledge bases when learning the word embedding. Such strategies may not be optimal, however, as they are limited by… ▽ More

    Submitted 4 June, 2019; originally announced June 2019.

    Comments: To appear at ACL 2019. 11 pages

  48. arXiv:1905.07358  [pdf, other

    cs.CL cs.SI

    Learning Cross-lingual Embeddings from Twitter via Distant Supervision

    Authors: Jose Camacho-Collados, Yerai Doval, Eugenio Martínez-Cámara, Luis Espinosa-Anke, Francesco Barbieri, Steven Schockaert

    Abstract: Cross-lingual embeddings represent the meaning of words from different languages in the same vector space. Recent work has shown that it is possible to construct such representations by aligning independently learned monolingual embedding spaces, and that accurate alignments can be obtained even without external bilingual data. In this paper we explore a research direction that has been surprising… ▽ More

    Submitted 31 March, 2020; v1 submitted 17 May, 2019; originally announced May 2019.

    Comments: Accepted to ICWSM 2020. 11 pages, 1 appendix. Pre-trained embeddings available at https://github.com/pedrada88/crossembeddings-twitter

  49. arXiv:1808.09121  [pdf, ps, other

    cs.CL

    WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

    Authors: Mohammad Taher Pilehvar, Jose Camacho-Collados

    Abstract: By design, word embeddings are unable to model the dynamic nature of words' semantics, i.e., the property of words to correspond to potentially different meanings. To address this limitation, dozens of specialized meaning representation techniques such as sense or contextualized embeddings have been proposed. However, despite the popularity of research on this topic, very few evaluation benchmarks… ▽ More

    Submitted 27 April, 2019; v1 submitted 28 August, 2018; originally announced August 2018.

    Comments: NAACL 2019

  50. arXiv:1808.08780  [pdf, other

    cs.CL

    Improving Cross-Lingual Word Embeddings by Meeting in the Middle

    Authors: Yerai Doval, Jose Camacho-Collados, Luis Espinosa-Anke, Steven Schockaert

    Abstract: Cross-lingual word embeddings are becoming increasingly important in multilingual NLP. Recently, it has been shown that these embeddings can be effectively learned by aligning two disjoint monolingual vector spaces through linear transformations, using no more than a small bilingual dictionary as supervision. In this work, we propose to apply an additional transformation after the initial alignmen… ▽ More

    Submitted 27 August, 2018; originally announced August 2018.

    Comments: 11 pages, 4 tables, 1 figure. EMNLP 2018 camera-ready