-
ARC-NLP at Multimodal Hate Speech Event Detection 2023: Multimodal Methods Boosted by Ensemble Learning, Syntactical and Entity Features
Authors:
Umitcan Sahin,
Izzet Emre Kucukkaya,
Oguzhan Ozcelik,
Cagri Toraman
Abstract:
Text-embedded images can serve as a means of spreading hate speech, propaganda, and extremist beliefs. Throughout the Russia-Ukraine war, both opposing factions heavily relied on text-embedded images as a vehicle for spreading propaganda and hate speech. Ensuring the effective detection of hate speech and propaganda is of utmost importance to mitigate the negative effect of hate speech disseminati…
▽ More
Text-embedded images can serve as a means of spreading hate speech, propaganda, and extremist beliefs. Throughout the Russia-Ukraine war, both opposing factions heavily relied on text-embedded images as a vehicle for spreading propaganda and hate speech. Ensuring the effective detection of hate speech and propaganda is of utmost importance to mitigate the negative effect of hate speech dissemination. In this paper, we outline our methodologies for two subtasks of Multimodal Hate Speech Event Detection 2023. For the first subtask, hate speech detection, we utilize multimodal deep learning models boosted by ensemble learning and syntactical text attributes. For the second subtask, target detection, we employ multimodal deep learning models boosted by named entity features. Through experimentation, we demonstrate the superior performance of our models compared to all textual, visual, and text-visual baselines employed in multimodal hate speech detection. Furthermore, our models achieve the first place in both subtasks on the final leaderboard of the shared task.
△ Less
Submitted 25 July, 2023;
originally announced July 2023.
-
Tweets Under the Rubble: Detection of Messages Calling for Help in Earthquake Disaster
Authors:
Cagri Toraman,
Izzet Emre Kucukkaya,
Oguzhan Ozcelik,
Umitcan Sahin
Abstract:
The importance of social media is again exposed in the recent tragedy of the 2023 Turkey and Syria earthquake. Many victims who were trapped under the rubble called for help by posting messages in Twitter. We present an interactive tool to provide situational awareness for missing and trapped people, and disaster relief for rescue and donation efforts. The system (i) collects tweets, (ii) classifi…
▽ More
The importance of social media is again exposed in the recent tragedy of the 2023 Turkey and Syria earthquake. Many victims who were trapped under the rubble called for help by posting messages in Twitter. We present an interactive tool to provide situational awareness for missing and trapped people, and disaster relief for rescue and donation efforts. The system (i) collects tweets, (ii) classifies the ones calling for help, (iii) extracts important entity tags, and (iv) visualizes them in an interactive map screen. Our initial experiments show that the performance in terms of the F1 score is up to 98.30 for tweet classification, and 84.32 for entity extraction. The demonstration, dataset, and other related files can be accessed at https://github.com/avaapm/deprem
△ Less
Submitted 26 February, 2023;
originally announced February 2023.
-
MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection
Authors:
Cagri Toraman,
Oguzhan Ozcelik,
Furkan Şahinuç,
Fazli Can
Abstract:
The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation…
▽ More
The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between 2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.
△ Less
Submitted 11 July, 2024; v1 submitted 11 October, 2022;
originally announced October 2022.
-
Impact of Tokenization on Language Models: An Analysis for Turkish
Authors:
Cagri Toraman,
Eyup Halit Yilmaz,
Furkan Şahinuç,
Oguzhan Ozcelik
Abstract:
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenize…
▽ More
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, where many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, i.e. their outputs vary from smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that Morphological-level tokenizer has challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.
△ Less
Submitted 19 April, 2022;
originally announced April 2022.