-
Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism
Authors:
Naba Rizvi,
Harper Strickland,
Saleha Ahmedi,
Aekta Kallepalli,
Isha Khirwadkar,
William Wu,
Imani N. S. Munyaka,
Nedjma Ousidhoum
Abstract:
Large language models (LLMs) are increasingly used in decision-making tasks like résumé screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ablei…
▽ More
Large language models (LLMs) are increasingly used in decision-making tasks like résumé screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Social Good or Scientific Curiosity? Uncovering the Research Framing Behind NLP Artefacts
Authors:
Eric Chamoun,
Nedjma Ousidhoum,
Michael Schlichtkrull,
Andreas Vlachos
Abstract:
Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that in…
▽ More
Clarifying the research framing of NLP artefacts (e.g., models, datasets, etc.) is crucial to aligning research with practical applications. Recent studies manually analyzed NLP research across domains, showing that few papers explicitly identify key stakeholders, intended uses, or appropriate contexts. In this work, we propose to automate this analysis, developing a three-component system that infers research framings by first extracting key elements (means, ends, stakeholders), then linking them through interpretable rules and contextual reasoning. We evaluate our approach on two domains: automated fact-checking using an existing dataset, and hate speech detection for which we annotate a new dataset-achieving consistent improvements over strong LLM baselines. Finally, we apply our system to recent automated fact-checking papers and uncover three notable trends: a rise in vague or underspecified research goals, increased emphasis on scientific exploration over application, and a shift toward supporting human fact-checkers rather than pursuing full automation.
△ Less
Submitted 24 May, 2025;
originally announced May 2025.
-
SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
Authors:
Shamsuddeen Hassan Muhammad,
Nedjma Ousidhoum,
Idris Abdulmumin,
Seid Muhie Yimam,
Jan Philip Wahle,
Terry Ruas,
Meriem Beloucif,
Christine De Kock,
Tadesse Destaw Belay,
Ibrahim Said Ahmad,
Nirmal Surange,
Daniela Teodorescu,
David Ifeoluwa Adelani,
Alham Fikri Aji,
Felermino Ali,
Vladimir Araujo,
Abinew Ali Ayele,
Oana Ignat,
Alexander Panchenko,
Yi Zhou,
Saif M. Mohammad
Abstract:
We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and are spoken across various continents. The data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels…
▽ More
We present our shared task on text-based emotion detection, covering more than 30 languages from seven distinct language families. These languages are predominantly low-resource and are spoken across various continents. The data instances are multi-labeled with six emotional classes, with additional datasets in 11 languages annotated for emotion intensity. Participants were asked to predict labels in three tracks: (a) multilabel emotion detection, (b) emotion intensity score detection, and (c) cross-lingual emotion detection.
The task attracted over 700 participants. We received final submissions from more than 200 teams and 93 system description papers. We report baseline results, along with findings on the best-performing systems, the most common approaches, and the most effective methods across different tracks and languages. The datasets for this task are publicly available. The dataset is available at SemEval2025 Task 11 https://brighter-dataset.github.io
△ Less
Submitted 24 April, 2025; v1 submitted 10 March, 2025;
originally announced March 2025.
-
BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages
Authors:
Shamsuddeen Hassan Muhammad,
Nedjma Ousidhoum,
Idris Abdulmumin,
Jan Philip Wahle,
Terry Ruas,
Meriem Beloucif,
Christine de Kock,
Nirmal Surange,
Daniela Teodorescu,
Ibrahim Said Ahmad,
David Ifeoluwa Adelani,
Alham Fikri Aji,
Felermino D. M. A. Ali,
Ilseyar Alimova,
Vladimir Araujo,
Nikolay Babakov,
Naomi Baes,
Ana-Maria Bucur,
Andiswa Bukula,
Guanqun Cao,
Rodrigo Tufino Cardenas,
Rendi Chevi,
Chiamaka Ijeoma Chukwuneke,
Alexandra Ciobotaru,
Daryna Dementieva
, et al. (23 additional authors not shown)
Abstract:
People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which oft…
▽ More
People worldwide use language in subtle and complex ways to express emotions. Although emotion recognition--an umbrella term for several NLP tasks--impacts various applications within NLP and beyond, most work in this area has focused on high-resource languages. This has led to significant disparities in research efforts and proposed solutions, particularly for under-resourced languages, which often lack high-quality annotated datasets. In this paper, we present BRIGHTER--a collection of multi-labeled, emotion-annotated datasets in 28 different languages and across several domains. BRIGHTER primarily covers low-resource languages from Africa, Asia, Eastern Europe, and Latin America, with instances labeled by fluent speakers. We highlight the challenges related to the data collection and annotation processes, and then report experimental results for monolingual and crosslingual multi-label emotion identification, as well as emotion intensity recognition. We analyse the variability in performance across languages and text domains, both with and without the use of LLMs, and show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
△ Less
Submitted 29 May, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
Authors:
Shamsuddeen Hassan Muhammad,
Idris Abdulmumin,
Abinew Ali Ayele,
David Ifeoluwa Adelani,
Ibrahim Said Ahmad,
Saminu Mohammad Aliyu,
Nelson Odhiambo Onyango,
Lilian D. A. Wanzare,
Samuel Rutunda,
Lukman Jibril Aliyu,
Esubalew Alemneh,
Oumaima Hourrane,
Hagos Tesfahun Gebremichael,
Elyas Abdi Ismail,
Meriem Beloucif,
Ebrahim Chekol Jibril,
Andiswa Bukula,
Rooweither Mabuya,
Salomey Osei,
Abigail Oppong,
Tadesse Destaw Belay,
Tadesse Kebede Guge,
Tesfa Tegegne Asfaw,
Chiamaka Ijeoma Chukwuneke,
Paul Röttger
, et al. (2 additional authors not shown)
Abstract:
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at…
▽ More
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate
△ Less
Submitted 15 January, 2025; v1 submitted 14 January, 2025;
originally announced January 2025.
-
AUTALIC: A Dataset for Anti-AUTistic Ableist Language In Context
Authors:
Naba Rizvi,
Harper Strickland,
Daniel Gitelman,
Tristan Cooper,
Alexis Morales-Flores,
Michael Golden,
Aekta Kallepalli,
Akshat Alurkar,
Haaset Owens,
Saleha Ahmedi,
Isha Khirwadkar,
Imani Munyaka,
Nedjma Ousidhoum
Abstract:
As our understanding of autism and ableism continues to increase, so does our understanding of ableist language towards autistic people. Such language poses a significant challenge in NLP research due to its subtle and context-dependent nature. Yet, detecting anti-autistic ableist language remains underexplored, with existing NLP tools often failing to capture its nuanced expressions. We present A…
▽ More
As our understanding of autism and ableism continues to increase, so does our understanding of ableist language towards autistic people. Such language poses a significant challenge in NLP research due to its subtle and context-dependent nature. Yet, detecting anti-autistic ableist language remains underexplored, with existing NLP tools often failing to capture its nuanced expressions. We present AUTALIC, the first benchmark dataset dedicated to the detection of anti-autistic ableist language in context, addressing a significant gap in the field. The dataset comprises 2,400 autism-related sentences collected from Reddit, accompanied by surrounding context, and is annotated by trained experts with backgrounds in neurodiversity. Our comprehensive evaluation reveals that current language models, including state-of-the-art LLMs, struggle to reliably identify anti-autistic ableism and align with human judgments, underscoring their limitations in this domain. We publicly release AUTALIC along with the individual annotations which serve as a valuable resource to researchers working on ableism, neurodiversity, and also studying disagreements in annotation tasks. This dataset serves as a crucial step towards developing more inclusive and context-aware NLP systems that better reflect diverse perspectives.
△ Less
Submitted 8 April, 2025; v1 submitted 21 October, 2024;
originally announced October 2024.
-
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Authors:
Genta Indra Winata,
Frederikus Hudi,
Patrick Amadeus Irawan,
David Anugraha,
Rifki Afina Putri,
Yutong Wang,
Adam Nohejl,
Ubaidillah Ariq Prathama,
Nedjma Ousidhoum,
Afifa Amriani,
Anar Rzayev,
Anirban Das,
Ashmari Pramodya,
Aulia Adila,
Bryan Wilie,
Candy Olivia Mawalim,
Ching Lam Cheng,
Daud Abolade,
Emmanuele Chersoni,
Enrico Santus,
Fariz Ikhwantri,
Garry Kuwanto,
Hanyang Zhao,
Haryo Akbarianto Wibowo,
Holy Lovenia
, et al. (26 additional authors not shown)
Abstract:
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering…
▽ More
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
△ Less
Submitted 8 May, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
Building Better: Avoiding Pitfalls in Developing Language Resources when Data is Scarce
Authors:
Nedjma Ousidhoum,
Meriem Beloucif,
Saif M. Mohammad
Abstract:
Language is a form of symbolic capital that affects people's lives in many ways (Bourdieu1977,1991). As a powerful means of communication, it reflects identities, cultures, traditions, and societies more broadly. Therefore, data in a given language should be regarded as more than just a collection of tokens. Rigorous data collection and labeling practices are essential for developing more human-ce…
▽ More
Language is a form of symbolic capital that affects people's lives in many ways (Bourdieu1977,1991). As a powerful means of communication, it reflects identities, cultures, traditions, and societies more broadly. Therefore, data in a given language should be regarded as more than just a collection of tokens. Rigorous data collection and labeling practices are essential for developing more human-centered and socially aware technologies. Although there has been growing interest in under-resourced languages within the NLP community, work in this area faces unique challenges, such as data scarcity and limited access to qualified annotators.
In this paper, we collect feedback from individuals directly involved in and impacted by NLP artefacts for medium- and low-resource languages. We conduct both quantitative and qualitative analyses of their responses and highlight key issues related to: (1) data quality, including linguistic and cultural appropriateness; and (2) the ethics of common annotation practices, such as the misuse of participatory research. Based on these findings, we make several recommendations for creating high-quality language artefacts that reflect the cultural milieu of their speakers, while also respecting the dignity and labor of data workers.
△ Less
Submitted 30 May, 2025; v1 submitted 16 October, 2024;
originally announced October 2024.
-
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
Authors:
Junho Myung,
Nayeon Lee,
Yi Zhou,
Jiho Jin,
Rifki Afina Putri,
Dimosthenis Antypas,
Hsuvas Borkakoty,
Eunsu Kim,
Carla Perez-Almendros,
Abinew Ali Ayele,
Víctor Gutiérrez-Basulto,
Yazmín Ibáñez-García,
Hwaran Lee,
Shamsuddeen Hassan Muhammad,
Kiwoong Park,
Anar Sabuhi Rzayev,
Nina White,
Seid Muhie Yimam,
Mohammad Taher Pilehvar,
Nedjma Ousidhoum,
Jose Camacho-Collados,
Alice Oh
Abstract:
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food…
▽ More
Large language models (LLMs) often lack culture-specific knowledge of daily life, especially across diverse regions and non-English languages. Existing benchmarks for evaluating LLMs' cultural sensitivities are limited to a single language or collected from online sources such as Wikipedia, which do not reflect the mundane everyday lifestyles of diverse regions. That is, information about the food people eat for their birthday celebrations, spices they typically use, musical instruments youngsters play, or the sports they practice in school is common cultural knowledge but uncommon in easily collected online sources, especially for underrepresented cultures. To address this issue, we introduce BLEnD, a hand-crafted benchmark designed to evaluate LLMs' everyday knowledge across diverse cultures and languages. BLEnD comprises 52.6k question-answer pairs from 16 countries/regions, in 13 different languages, including low-resource ones such as Amharic, Assamese, Azerbaijani, Hausa, and Sundanese. We construct the benchmark to include two formats of questions: short-answer and multiple-choice. We show that LLMs perform better for cultures that are highly represented online, with a maximum 57.34% difference in GPT-4, the best-performing model, in the short-answer format. For cultures represented by mid-to-high-resource languages, LLMs perform better in their local languages, but for cultures represented by low-resource languages, LLMs perform better in English than the local languages. We make our dataset publicly available at: https://github.com/nlee0212/BLEnD.
△ Less
Submitted 15 January, 2025; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Words as Trigger Points in Social Media Discussions: A Large-Scale Case Study about UK Politics on Reddit
Authors:
Dimosthenis Antypas,
Christian Arnold,
Jose Camacho-Collados,
Nedjma Ousidhoum,
Carla Perez Almendros
Abstract:
Political debates on social media sometimes flare up. From that moment on, users engage much more with one another; their communication is also more emotional and polarised. While it has been difficult to grasp such moments with computational methods, we suggest that trigger points are a useful concept to understand and ultimately model such behaviour. Established in qualitative focus group interv…
▽ More
Political debates on social media sometimes flare up. From that moment on, users engage much more with one another; their communication is also more emotional and polarised. While it has been difficult to grasp such moments with computational methods, we suggest that trigger points are a useful concept to understand and ultimately model such behaviour. Established in qualitative focus group interviews to understand political polarisation (Mau, Lux, and Westheuser 2023), trigger points represent moments when individuals feel that their understanding of what is fair, normal, or appropriate in society is questioned. In the original studies, individuals show strong and negative emotional responses when certain triggering words or topics are mentioned. Our paper finds that these trigger points also exist in online debates. We examine online deliberations on Reddit between 2020 and 2022 and collect >100 million comments from subreddits related to a set of words identified as trigger points in UK politics. Analysing the comments, we find that trigger words increase user engagement and animosity, i.e., more negativity, hate speech, and controversial comments. Introducing trigger points to computational studies of online communication, our findings are relevant to researchers interested in affective computing, online deliberation, and how citizens debate politics and society in light of affective polarisation.
△ Less
Submitted 24 June, 2025; v1 submitted 16 May, 2024;
originally announced May 2024.
-
SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages
Authors:
Nedjma Ousidhoum,
Shamsuddeen Hassan Muhammad,
Mohamed Abdalla,
Idris Abdulmumin,
Ibrahim Said Ahmad,
Sanchit Ahuja,
Alham Fikri Aji,
Vladimir Araujo,
Meriem Beloucif,
Christine De Kock,
Oumaima Hourrane,
Manish Shrivastava,
Thamar Solorio,
Nirmal Surange,
Krishnapriya Vishnubhotla,
Seid Muhie Yimam,
Saif M. Mohammad
Abstract:
We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. The…
▽ More
We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.
△ Less
Submitted 17 April, 2024; v1 submitted 27 March, 2024;
originally announced March 2024.
-
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Authors:
Nedjma Ousidhoum,
Shamsuddeen Hassan Muhammad,
Mohamed Abdalla,
Idris Abdulmumin,
Ibrahim Said Ahmad,
Sanchit Ahuja,
Alham Fikri Aji,
Vladimir Araujo,
Abinew Ali Ayele,
Pavan Baswani,
Meriem Beloucif,
Chris Biemann,
Sofia Bourhim,
Christine De Kock,
Genet Shanko Dekebo,
Oumaima Hourrane,
Gopichand Kanumolu,
Lokesh Madasu,
Samuel Rutunda,
Manish Shrivastava,
Thamar Solorio,
Nirmal Surange,
Hailegnaw Getaneh Tilaye,
Krishnapriya Vishnubhotla,
Genta Winata
, et al. (2 additional authors not shown)
Abstract:
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dat…
▽ More
Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present \textit{SemRel}, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: \textit{Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish,} and \textit{Telugu}. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia -- regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.
△ Less
Submitted 31 May, 2024; v1 submitted 13 February, 2024;
originally announced February 2024.
-
The Intended Uses of Automated Fact-Checking Artefacts: Why, How and Who
Authors:
Michael Schlichtkrull,
Nedjma Ousidhoum,
Andreas Vlachos
Abstract:
Automated fact-checking is often presented as an epistemic tool that fact-checkers, social media consumers, and other stakeholders can use to fight misinformation. Nevertheless, few papers thoroughly discuss how. We document this by analysing 100 highly-cited papers, and annotating epistemic elements related to intended use, i.e., means, ends, and stakeholders. We find that narratives leaving out…
▽ More
Automated fact-checking is often presented as an epistemic tool that fact-checkers, social media consumers, and other stakeholders can use to fight misinformation. Nevertheless, few papers thoroughly discuss how. We document this by analysing 100 highly-cited papers, and annotating epistemic elements related to intended use, i.e., means, ends, and stakeholders. We find that narratives leaving out some of these aspects are common, that many papers propose inconsistent means and ends, and that the feasibility of suggested strategies rarely has empirical backing. We argue that this vagueness actively hinders the technology from reaching its goals, as it encourages overclaiming, limits criticism, and prevents stakeholder feedback. Accordingly, we provide several recommendations for thinking and writing about the use of fact-checking artefacts.
△ Less
Submitted 8 November, 2023; v1 submitted 27 April, 2023;
originally announced April 2023.
-
SemEval-2023 Task 12: Sentiment Analysis for African Languages (AfriSenti-SemEval)
Authors:
Shamsuddeen Hassan Muhammad,
Idris Abdulmumin,
Seid Muhie Yimam,
David Ifeoluwa Adelani,
Ibrahim Sa'id Ahmad,
Nedjma Ousidhoum,
Abinew Ayele,
Saif M. Mohammad,
Meriem Beloucif,
Sebastian Ruder
Abstract:
We present the first Africentric SemEval Shared task, Sentiment Analysis for African Languages (AfriSenti-SemEval) - The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023. AfriSenti-SemEval is a sentiment classification challenge in 14 African languages: Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oro…
▽ More
We present the first Africentric SemEval Shared task, Sentiment Analysis for African Languages (AfriSenti-SemEval) - The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023. AfriSenti-SemEval is a sentiment classification challenge in 14 African languages: Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá (Muhammad et al., 2023), using data labeled with 3 sentiment classes. We present three subtasks: (1) Task A: monolingual classification, which received 44 submissions; (2) Task B: multilingual classification, which received 32 submissions; and (3) Task C: zero-shot classification, which received 34 submissions. The best performance for tasks A and B was achieved by NLNDE team with 71.31 and 75.06 weighted F1, respectively. UCAS-IIE-NLP achieved the best average score for task C with 58.15 weighted F1. We describe the various approaches adopted by the top 10 systems and their approaches.
△ Less
Submitted 1 May, 2023; v1 submitted 13 April, 2023;
originally announced April 2023.
-
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Authors:
Shamsuddeen Hassan Muhammad,
Idris Abdulmumin,
Abinew Ali Ayele,
Nedjma Ousidhoum,
David Ifeoluwa Adelani,
Seid Muhie Yimam,
Ibrahim Sa'id Ahmad,
Meriem Beloucif,
Saif M. Mohammad,
Sebastian Ruder,
Oumaima Hourrane,
Pavel Brazdil,
Felermino Dário Mário António Ali,
Davis David,
Salomey Osei,
Bello Shehu Bello,
Falalu Ibrahim,
Tajuddeen Gwadabe,
Samuel Rutunda,
Tadesse Belay,
Wendimu Baye Messelle,
Hailu Beshada Balcha,
Sisay Adugna Chala,
Hagos Tesfahun Gebremichael,
Bernard Opoku
, et al. (1 additional authors not shown)
Abstract:
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti…
▽ More
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task (The AfriSenti Shared Task had over 200 participants. See website at https://afrisenti-semeval.github.io). We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the different datasets and discuss their usefulness.
△ Less
Submitted 4 November, 2023; v1 submitted 17 February, 2023;
originally announced February 2023.
-
Varifocal Question Generation for Fact-checking
Authors:
Nedjma Ousidhoum,
Zhangdie Yuan,
Andreas Vlachos
Abstract:
Fact-checking requires retrieving evidence related to a claim under investigation. The task can be formulated as question generation based on a claim, followed by question answering. However, recent question generation approaches assume that the answer is known and typically contained in a passage given as input, whereas such passages are what is being sought when verifying a claim. In this paper,…
▽ More
Fact-checking requires retrieving evidence related to a claim under investigation. The task can be formulated as question generation based on a claim, followed by question answering. However, recent question generation approaches assume that the answer is known and typically contained in a passage given as input, whereas such passages are what is being sought when verifying a claim. In this paper, we present {\it Varifocal}, a method that generates questions based on different focal points within a given claim, i.e.\ different spans of the claim and its metadata, such as its source and date. Our method outperforms previous work on a fact-checking question generation dataset on a wide range of automatic evaluation metrics. These results are corroborated by our manual evaluation, which indicates that our method generates more relevant and informative questions. We further demonstrate the potential of focal points in generating sets of clarification questions for product descriptions.
△ Less
Submitted 22 October, 2022;
originally announced October 2022.
-
Multilingual and Multi-Aspect Hate Speech Analysis
Authors:
Nedjma Ousidhoum,
Zizheng Lin,
Hongming Zhang,
Yangqiu Song,
Dit-Yan Yeung
Abstract:
Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual multi-aspect hate speech analysis dataset and use it to test the current state-of-the-art multilingual multitask learning approaches. We evaluate our dataset in various classification settings, then we discuss how to leverage our annotatio…
▽ More
Current research on hate speech analysis is typically oriented towards monolingual and single classification tasks. In this paper, we present a new multilingual multi-aspect hate speech analysis dataset and use it to test the current state-of-the-art multilingual multitask learning approaches. We evaluate our dataset in various classification settings, then we discuss how to leverage our annotations in order to improve hate speech detection and classification in general.
△ Less
Submitted 29 August, 2019;
originally announced August 2019.