Search | arXiv e-print repository

Grounding Toxicity in Real-World Events across Languages

Authors: Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Abstract: Social media conversations frequently suffer from toxicity, creating significant issues for users, moderators, and entire communities. Events in the real world, like elections or conflicts, can initiate and escalate toxic behavior online. Our study investigates how real-world events influence the origin and spread of toxicity in online discussions across various languages and regions. We gathered… ▽ More Social media conversations frequently suffer from toxicity, creating significant issues for users, moderators, and entire communities. Events in the real world, like elections or conflicts, can initiate and escalate toxic behavior online. Our study investigates how real-world events influence the origin and spread of toxicity in online discussions across various languages and regions. We gathered Reddit data comprising 4.5 million comments from 31 thousand posts in six different languages (Dutch, English, German, Arabic, Turkish and Spanish). We target fifteen major social and political world events that occurred between 2020 and 2023. We observe significant variations in toxicity, negative sentiment, and emotion expressions across different events and language communities, showing that toxicity is a complex phenomenon in which many different factors interact and still need to be investigated. We will release the data for further research along with our code. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: Paper accepted for at The 29th International Conference on Natural Language & Information Systems (NLDB 2024)

arXiv:2404.18810 [pdf, other]

Unknown Script: Impact of Script on Cross-Lingual Transfer

Authors: Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Abstract: Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolin… ▽ More Cross-lingual transfer has become an effective way of transferring knowledge between languages. In this paper, we explore an often overlooked aspect in this domain: the influence of the source language of a language model on language transfer performance. We consider a case where the target language and its script are not part of the pre-trained model. We conduct a series of experiments on monolingual and multilingual models that are pre-trained on different tokenization methods to determine factors that affect cross-lingual transfer to a new language with a unique script. Our findings reveal the importance of the tokenizer as a stronger factor than the shared script, language similarity, and model size. △ Less

Submitted 7 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

Comments: Paper accepted to NAACL Student Research Workshop (SRW) 2024

arXiv:2404.18726 [pdf, other]

The Constant in HATE: Analyzing Toxicity in Reddit across Topics and Languages

Authors: Wondimagegnhue Tsegaye Tufa, Ilia Markov, Piek Vossen

Abstract: Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politic… ▽ More Toxic language remains an ongoing challenge on social media platforms, presenting significant issues for users and communities. This paper provides a cross-topic and cross-lingual analysis of toxicity in Reddit conversations. We collect 1.5 million comment threads from 481 communities in six languages: English, German, Spanish, Turkish,Arabic, and Dutch, covering 80 topics such as Culture, Politics, and News. We thoroughly analyze how toxicity spikes within different communities in relation to specific topics. We observe consistent patterns of increased toxicity across languages for certain topics, while also noting significant variations within specific language communities. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: Accepted to TRAC 2024

arXiv:2306.09642 [pdf, ps, other]

Cross-Domain Toxic Spans Detection

Authors: Stefan F. Schouten, Baran Barbarestani, Wondimagegnhue Tufa, Piek Vossen, Ilia Markov

Abstract: Given the dynamic nature of toxic language use, automated methods for detecting toxic spans are likely to encounter distributional shift. To explore this phenomenon, we evaluate three approaches for detecting toxic spans under cross-domain conditions: lexicon-based, rationale extraction, and fine-tuned language models. Our findings indicate that a simple method using off-the-shelf lexicons perform… ▽ More Given the dynamic nature of toxic language use, automated methods for detecting toxic spans are likely to encounter distributional shift. To explore this phenomenon, we evaluate three approaches for detecting toxic spans under cross-domain conditions: lexicon-based, rationale extraction, and fine-tuned language models. Our findings indicate that a simple method using off-the-shelf lexicons performs best in the cross-domain setup. The cross-domain error analysis suggests that (1) rationale extraction methods are prone to false negatives, while (2) language models, despite performing best for the in-domain case, recall fewer explicitly toxic words than lexicons and are prone to certain types of false positives. Our code is publicly available at: https://github.com/sfschouten/toxic-cross-domain. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: NLDB 2023

Showing 1–4 of 4 results for author: Tufa, W