-
Multimodal Misinformation Detection Using Early Fusion of Linguistic, Visual, and Social Features
Authors:
Gautam Kishore Shahi
Abstract:
Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study in…
▽ More
Amid a tidal wave of misinformation flooding social media during elections and crises, extensive research has been conducted on misinformation detection, primarily focusing on text-based or image-based approaches. However, only a few studies have explored multimodal feature combinations, such as integrating text and images for building a classification model to detect misinformation. This study investigates the effectiveness of different multimodal feature combinations, incorporating text, images, and social features using an early fusion approach for the classification model. This study analyzed 1,529 tweets containing both text and images during the COVID-19 pandemic and election periods collected from Twitter (now X). A data enrichment process was applied to extract additional social features, as well as visual features, through techniques such as object detection and optical character recognition (OCR). The results show that combining unsupervised and supervised machine learning models improves classification performance by 15% compared to unimodal models and by 5% compared to bimodal models. Additionally, the study analyzes the propagation patterns of misinformation based on the characteristics of misinformation tweets and the users who disseminate them.
△ Less
Submitted 26 June, 2025;
originally announced July 2025.
-
SemCAFE: When Named Entities make the Difference Assessing Web Source Reliability through Entity-level Analytics
Authors:
Gautam Kishore Shahi,
Oshani Seneviratne,
Marc Spaniol
Abstract:
With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets political leanings or credibility, determining unr…
▽ More
With the shift from traditional to digital media, the online landscape now hosts not only reliable news articles but also a significant amount of unreliable content. Digital media has faster reachability by significantly influencing public opinion and advancing political agendas. While newspaper readers may be familiar with their preferred outlets political leanings or credibility, determining unreliable news articles is much more challenging. The credibility of many online sources is often opaque, with AI generated content being easily disseminated at minimal cost. Unreliable news articles, particularly those that followed the Russian invasion of Ukraine in 2022, closely mimic the topics and writing styles of credible sources, making them difficult to distinguish. To address this, we introduce SemCAFE, a system designed to detect news reliability by incorporating entity relatedness into its assessment. SemCAFE employs standard Natural Language Processing techniques, such as boilerplate removal and tokenization, alongside entity level semantic analysis using the YAGO knowledge base. By creating a semantic fingerprint for each news article, SemCAFE could assess the credibility of 46,020 reliable and 3,407 unreliable articles on the 2022 Russian invasion of Ukraine. Our approach improved the macro F1 score by 12% over state of the art methods. The sample data and code are available on GitHub
△ Less
Submitted 3 April, 2025;
originally announced April 2025.
-
A Year of the DSA Transparency Database: What it (Does Not) Reveal About Platform Moderation During the 2024 European Parliament Election
Authors:
Gautam Kishore Shahi,
Benedetta Tessa,
Amaury Trujillo,
Stefano Cresci
Abstract:
Social media platforms face heightened risks during major political events; yet, how platforms adapt their moderation practices in response remains unclear. The Digital Services Act Transparency Database offers an unprecedented opportunity to systematically study content moderation at scale, enabling researchers and policymakers to assess platforms' compliance and effectiveness. Herein, we analyze…
▽ More
Social media platforms face heightened risks during major political events; yet, how platforms adapt their moderation practices in response remains unclear. The Digital Services Act Transparency Database offers an unprecedented opportunity to systematically study content moderation at scale, enabling researchers and policymakers to assess platforms' compliance and effectiveness. Herein, we analyze 1.58 billion self-reported moderation actions taken by eight large social media platforms during an extended period of eight months surrounding the 2024 European Parliament elections. Our findings reveal a lack of adaptation in moderation strategies, as platforms did not exhibit significant changes in their enforcement behaviors surrounding the elections. This raises concerns about whether platforms adapted their moderation practices at all, or if structural limitations of the database concealed possible adjustments. Moreover, we found that noted transparency and accountability issues persist nearly a year after initial concerns were raised. These results highlight the limitations of current self-regulatory approaches and underscore the need for stronger enforcement and data access mechanisms to ensure that online platforms uphold their responsibility in safeguarding democratic processes.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Too Little, Too Late: Moderation of Misinformation around the Russo-Ukrainian Conflict
Authors:
Gautam Kishore Shahi,
Yelena Mejova
Abstract:
In this study, we examine the role of Twitter as a first line of defense against misinformation by tracking the public engagement with, and the platforms response to, 500 tweets concerning the RussoUkrainian conflict which were identified as misinformation. Using a realtime sample of 543 475 of their retweets, we find that users who geolocate themselves in the U.S. both produce and consume the lar…
▽ More
In this study, we examine the role of Twitter as a first line of defense against misinformation by tracking the public engagement with, and the platforms response to, 500 tweets concerning the RussoUkrainian conflict which were identified as misinformation. Using a realtime sample of 543 475 of their retweets, we find that users who geolocate themselves in the U.S. both produce and consume the largest portion of misinformation, however accounts claiming to be in Ukraine are the second largest source. At the time of writing, 84% of these tweets were still available on the platform, especially those having an anti-Russia narrative. For those that did receive some sanctions, the retweeting rate has already stabilized, pointing to ineffectiveness of the measures to stem their spread. These findings point to the need for a change in the existing anti-misinformation system ecosystem. We propose several design and research guidelines for its possible improvement.
△ Less
Submitted 20 February, 2025;
originally announced February 2025.
-
On the Effectiveness of Large Language Models in Automating Categorization of Scientific Texts
Authors:
Gautam Kishore Shahi,
Oliver Hummel
Abstract:
The rapid advancement of Large Language Models (LLMs) has led to a multitude of application opportunities. One traditional task for Information Retrieval systems is the summarization and classification of texts, both of which are important for supporting humans in navigating large literature bodies as they e.g. exist with scientific publications. Due to this rapidly growing body of scientific know…
▽ More
The rapid advancement of Large Language Models (LLMs) has led to a multitude of application opportunities. One traditional task for Information Retrieval systems is the summarization and classification of texts, both of which are important for supporting humans in navigating large literature bodies as they e.g. exist with scientific publications. Due to this rapidly growing body of scientific knowledge, recent research has been aiming at building research information systems that not only offer traditional keyword search capabilities, but also novel features such as the automatic detection of research areas that are present at knowledge intensive organizations in academia and industry. To facilitate this idea, we present the results obtained from evaluating a variety of LLMs in their ability to sort scientific publications into hierarchical classifications systems. Using the FORC dataset as ground truth data, we have found that recent LLMs (such as Meta Llama 3.1) are able to reach an accuracy of up to 0.82, which is up to 0.08 better than traditional BERT models.
△ Less
Submitted 8 February, 2025;
originally announced February 2025.
-
Hate Speech Detection Using Cross-Platform Social Media Data In English and German Language
Authors:
Gautam Kishore Shahi,
Tim A. Majchrzak
Abstract:
Hate speech has grown into a pervasive phenomenon, intensifying during times of crisis, elections, and social unrest. Multiple approaches have been developed to detect hate speech using artificial intelligence, but a generalized model is yet unaccomplished. The challenge for hate speech detection as text classification is the cost of obtaining high-quality training data. This study focuses on dete…
▽ More
Hate speech has grown into a pervasive phenomenon, intensifying during times of crisis, elections, and social unrest. Multiple approaches have been developed to detect hate speech using artificial intelligence, but a generalized model is yet unaccomplished. The challenge for hate speech detection as text classification is the cost of obtaining high-quality training data. This study focuses on detecting bilingual hate speech in YouTube comments and measuring the impact of using additional data from other platforms in the performance of the classification model. We examine the value of additional training datasets from cross-platforms for improving the performance of classification models. We also included factors such as content similarity, definition similarity, and common hate words to measure the impact of datasets on performance. Our findings show that adding more similar datasets based on content similarity, hate words, and definitions improves the performance of classification models. The best performance was obtained by combining datasets from YouTube comments, Twitter, and Gab with an F1-score of 0.74 and 0.68 for English and German YouTube comments.
△ Less
Submitted 2 October, 2024;
originally announced October 2024.
-
Multi-Platform Framing Analysis: A Case Study of Kristiansand Quran Burning
Authors:
Anna-Katharina Jung,
Gautam Kishore Shahi,
Jennifer Fromm,
Kari Anne Røysland,
Kim Henrik Gronert
Abstract:
The framing of events in various media and discourse spaces is crucial in the era of misinformation and polarization. Many studies, however, are limited to specific media or networks, disregarding the importance of cross-platform diffusion. This study overcomes that limitation by conducting a multi-platform framing analysis on Twitter, YouTube, and traditional media analyzing the 2019 Koran burnin…
▽ More
The framing of events in various media and discourse spaces is crucial in the era of misinformation and polarization. Many studies, however, are limited to specific media or networks, disregarding the importance of cross-platform diffusion. This study overcomes that limitation by conducting a multi-platform framing analysis on Twitter, YouTube, and traditional media analyzing the 2019 Koran burning in Kristiansand, Norway. It examines media and policy frames and uncovers network connections through shared URLs. The findings show that online news emphasizes the incident's legality, while social media focuses on its morality, with harsh hate speech prevalent in YouTube comments. Additionally, YouTube is identified as the most self-contained community, whereas Twitter is the most open to external inputs.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
Enhancing Research Information Systems with Identification of Domain Experts
Authors:
Gautam Kishore Shahi,
Oliver Hummel
Abstract:
Research organisations and their research outputs have been growing considerably in the past decades. This large body of knowledge attracts various stakeholders, e.g., for knowledge sharing, technology transfer, or potential collaborations. However, due to the large amount of complex knowledge created, traditional methods of manually curating catalogues are often out of time, imprecise, and cumber…
▽ More
Research organisations and their research outputs have been growing considerably in the past decades. This large body of knowledge attracts various stakeholders, e.g., for knowledge sharing, technology transfer, or potential collaborations. However, due to the large amount of complex knowledge created, traditional methods of manually curating catalogues are often out of time, imprecise, and cumbersome. Finding domain experts and knowledge within any larger organisation, scientific and also industrial, has thus become a serious challenge. Hence, exploring an institutions domain knowledge and finding its experts can only be solved by an automated solution. This work presents the scheme of an automated approach for identifying scholarly experts based on their publications and, prospectively, their teaching materials. Based on a search engine, this approach is currently being implemented for two universities, for which some examples are presented. The proposed system will be helpful for finding peer researchers as well as starting points for knowledge exploitation and technology transfer. As the system is designed in a scalable manner, it can easily include additional institutions and hence provide a broader coverage of research facilities in the future.
△ Less
Submitted 28 March, 2024;
originally announced April 2024.
-
TweetInfo: An Interactive System to Mitigate Online Harm
Authors:
Gautam Kishore Shahi
Abstract:
The increase in active users on social networking sites (SNSs) has also observed an increase in harmful content on social media sites. Harmful content is described as an inappropriate activity to harm or deceive an individual or a group of users. Alongside existing methods to detect misinformation and hate speech, users still need to be well-informed about the harmfulness of the content on SNSs. T…
▽ More
The increase in active users on social networking sites (SNSs) has also observed an increase in harmful content on social media sites. Harmful content is described as an inappropriate activity to harm or deceive an individual or a group of users. Alongside existing methods to detect misinformation and hate speech, users still need to be well-informed about the harmfulness of the content on SNSs. This study proposes a user-interactive system TweetInfo for mitigating the consumption of harmful content by providing metainformation about the posts. It focuses on two types of harmful content: hate speech and misinformation. TweetInfo provides insights into tweets by doing content analysis. Based on previous research, we have selected a list of metainformation. We offer the option to filter content based on metainformation Bot, Hate Speech, Misinformation, Verified Account, Sentiment, Tweet Category, Language. The proposed user interface allows customising the user's timeline to mitigate harmful content. This study present the demo version of the propose user interface of TweetInfo.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
FakeClaim: A Multiple Platform-driven Dataset for Identification of Fake News on 2023 Israel-Hamas War
Authors:
Gautam Kishore Shahi,
Amit Kumar Jaiswal,
Thomas Mandl
Abstract:
We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification. The FakeClaim data is collected from 60 fact-checking organizations in 30 languages and enriched with metadata from the fact-checking organizations curated by trained journalists specialized in fact-check…
▽ More
We contribute the first publicly available dataset of factual claims from different platforms and fake YouTube videos on the 2023 Israel-Hamas war for automatic fake YouTube video classification. The FakeClaim data is collected from 60 fact-checking organizations in 30 languages and enriched with metadata from the fact-checking organizations curated by trained journalists specialized in fact-checking. Further, we classify fake videos within the subset of YouTube videos using textual information and user comments. We used a pre-trained model to classify each video with different feature combinations. Our best-performing fine-tuned language model, Universal Sentence Encoder (USE), achieves a Macro F1 of 87\%, which shows that the trained model can be helpful for debunking fake videos using the comments from the user discussion. The dataset is available on Github\footnote{https://github.com/Gautamshahi/FakeClaim}
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Regret, Delete, (Do Not) Repeat: An Analysis of Self-Cleaning Practices on Twitter After the Outbreak of the COVID-19 Pandemic
Authors:
Nicolás E. Díaz Ferreyra,
Gautam Kishore Shahi,
Catherine Tony,
Stefan Stieglitz,
Riccardo Scandariato
Abstract:
During the outbreak of the COVID-19 pandemic, many people shared their symptoms across Online Social Networks (OSNs) like Twitter, hoping for others' advice or moral support. Prior studies have shown that those who disclose health-related information across OSNs often tend to regret it and delete their publications afterwards. Hence, deleted posts containing sensitive data can be seen as manifesta…
▽ More
During the outbreak of the COVID-19 pandemic, many people shared their symptoms across Online Social Networks (OSNs) like Twitter, hoping for others' advice or moral support. Prior studies have shown that those who disclose health-related information across OSNs often tend to regret it and delete their publications afterwards. Hence, deleted posts containing sensitive data can be seen as manifestations of online regrets. In this work, we present an analysis of deleted content on Twitter during the outbreak of the COVID-19 pandemic. For this, we collected more than 3.67 million tweets describing COVID-19 symptoms (e.g., fever, cough, and fatigue) posted between January and April 2020. We observed that around 24% of the tweets containing personal pronouns were deleted either by their authors or by the platform after one year. As a practical application of the resulting dataset, we explored its suitability for the automatic classification of regrettable content on Twitter.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Towards a Better Understanding of Online Influence: Differences in Twitter CommunicationBetween Companies and Influencers
Authors:
Diana C. Hernandez-Bocanegra,
Angela Borchert,
Felix Brünker,
Gautam Kishore Shahi,
Björn Ross
Abstract:
In the last decade, Social Media platforms such as Twitter have gained importance in the various marketing strategies of companies. This work aims to examine the presence of influential content on a textual level, by investigating characteristics of tweets in the context of social impact theory, and its dimension immediacy. To this end, we analysed influential Twitter communication data during Bla…
▽ More
In the last decade, Social Media platforms such as Twitter have gained importance in the various marketing strategies of companies. This work aims to examine the presence of influential content on a textual level, by investigating characteristics of tweets in the context of social impact theory, and its dimension immediacy. To this end, we analysed influential Twitter communication data during Black Friday 2018 with methods from social media analytics such as sentiment analysis and degree centrality. Results show significant differences in communication style between companies and influencers. Companies published longer textual content and created more tweets with a positive sentiment and more first-person pronouns than influencers. These findings shall serve as a basis for a future experimental study to examine the impact of text presence on consumer cognition and the willingness to purchase.
△ Less
Submitted 16 February, 2022;
originally announced February 2022.
-
Overview of the HASOC Subtrack at FIRE 2021: Hate Speech and Offensive Content Identification in English and Indo-Aryan Languages
Authors:
Thomas Mandl,
Sandip Modha,
Gautam Kishore Shahi,
Hiren Madhu,
Shrey Satapara,
Prasenjit Majumder,
Johannes Schaefer,
Tharindu Ranasinghe,
Marcos Zampieri,
Durgesh Nandini,
Amit Kumar Jaiswal
Abstract:
The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop…
▽ More
The widespread of offensive content online such as hate speech poses a growing societal problem. AI tools are necessary for supporting the moderation process at online platforms. For the evaluation of these identification tools, continuous experimentation with data sets in different languages are necessary. The HASOC track (Hate Speech and Offensive Content Identification) is dedicated to develop benchmark data for this purpose. This paper presents the HASOC subtrack for English, Hindi, and Marathi. The data set was assembled from Twitter. This subtrack has two sub-tasks. Task A is a binary classification problem (Hate and Not Offensive) offered for all three languages. Task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY offered for English and Hindi. Overall, 652 runs were submitted by 65 teams. The performance of the best classification algorithms for task A are F1 measures 0.91, 0.78 and 0.83 for Marathi, Hindi and English, respectively. This overview presents the tasks and the data development as well as the detailed results. The systems submitted to the competition applied a variety of technologies. The best performing algorithms were mainly variants of transformer architectures.
△ Less
Submitted 16 December, 2021;
originally announced December 2021.
-
Overview of the CLEF--2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News
Authors:
Preslav Nakov,
Giovanni Da San Martino,
Tamer Elsayed,
Alberto Barrón-Cedeño,
Rubén Míguez,
Shaden Shaar,
Firoj Alam,
Fatima Haouari,
Maram Hasanain,
Watheq Mansour,
Bayan Hamdan,
Zien Sheikh Ali,
Nikolay Babulkov,
Alex Nikolov,
Gautam Kishore Shahi,
Julia Maria Struß,
Thomas Mandl,
Mucahid Kutlu,
Yavuz Selim Kartal
Abstract:
We describe the fourth edition of the CheckThat! Lab, part of the 2021 Conference and Labs of the Evaluation Forum (CLEF). The lab evaluates technology supporting tasks related to factuality, and covers Arabic, Bulgarian, English, Spanish, and Turkish. Task 1 asks to predict which posts in a Twitter stream are worth fact-checking, focusing on COVID-19 and politics (in all five languages). Task 2 a…
▽ More
We describe the fourth edition of the CheckThat! Lab, part of the 2021 Conference and Labs of the Evaluation Forum (CLEF). The lab evaluates technology supporting tasks related to factuality, and covers Arabic, Bulgarian, English, Spanish, and Turkish. Task 1 asks to predict which posts in a Twitter stream are worth fact-checking, focusing on COVID-19 and politics (in all five languages). Task 2 asks to determine whether a claim in a tweet can be verified using a set of previously fact-checked claims (in Arabic and English). Task 3 asks to predict the veracity of a news article and its topical domain (in English). The evaluation is based on mean average precision or precision at rank k for the ranking tasks, and macro-F1 for the classification tasks. This was the most popular CLEF-2021 lab in terms of team registrations: 132 teams. Nearly one-third of them participated: 15, 5, and 25 teams submitted official runs for tasks 1, 2, and 3, respectively.
△ Less
Submitted 23 September, 2021;
originally announced September 2021.
-
Who shapes crisis communication on Twitter? An analysis of influential German-language accounts during the COVID-19 pandemic
Authors:
Gautam Kishore Shahi,
Sünje Clausen,
Stefan Stieglitz
Abstract:
Twitter is becoming an increasingly important platform for disseminating information during crisis situations, such as the COVID-19 pandemic. Effective crisis communication on Twitter can shape the public perception of the crisis, influence adherence to preventative measures, and thus affect public health. Influential accounts are particularly important as they reach large audiences quickly. This…
▽ More
Twitter is becoming an increasingly important platform for disseminating information during crisis situations, such as the COVID-19 pandemic. Effective crisis communication on Twitter can shape the public perception of the crisis, influence adherence to preventative measures, and thus affect public health. Influential accounts are particularly important as they reach large audiences quickly. This study identifies influential German-language accounts from almost 3 million German tweets collected between January and May 2020 by constructing a retweet network and calculating PageRank centrality values. We capture the volatility of crisis communication by structuring the analysis into seven stages based on key events during the pandemic and profile influential accounts into roles. Our analysis shows that news and journalist accounts were influential throughout all phases, while government accounts were particularly important shortly before and after the lockdown was instantiated. We discuss implications for crisis communication during health crises and for analyzing long-term crisis data.
△ Less
Submitted 12 September, 2021;
originally announced September 2021.
-
Overview of the HASOC track at FIRE 2020: Hate Speech and Offensive Content Identification in Indo-European Languages
Authors:
Thomas Mandla,
Sandip Modha,
Gautam Kishore Shahi,
Amit Kumar Jaiswal,
Durgesh Nandini,
Daksh Patel,
Prasenjit Majumder,
Johannes Schäfer
Abstract:
With the growth of social media, the spread of hate speech is also increasing rapidly. Social media are widely used in many countries. Also Hate Speech is spreading in these countries. This brings a need for multilingual Hate Speech detection algorithms. Much research in this area is dedicated to English at the moment. The HASOC track intends to provide a platform to develop and optimize Hate Spee…
▽ More
With the growth of social media, the spread of hate speech is also increasing rapidly. Social media are widely used in many countries. Also Hate Speech is spreading in these countries. This brings a need for multilingual Hate Speech detection algorithms. Much research in this area is dedicated to English at the moment. The HASOC track intends to provide a platform to develop and optimize Hate Speech detection algorithms for Hindi, German and English. The dataset is collected from a Twitter archive and pre-classified by a machine learning system. HASOC has two sub-task for all three languages: task A is a binary classification problem (Hate and Not Offensive) while task B is a fine-grained classification problem for three classes (HATE) Hate speech, OFFENSIVE and PROFANITY. Overall, 252 runs were submitted by 40 teams. The performance of the best classification algorithms for task A are F1 measures of 0.51, 0.53 and 0.52 for English, Hindi, and German, respectively. For task B, the best classification algorithms achieved F1 measures of 0.26, 0.33 and 0.29 for English, Hindi, and German, respectively. This article presents the tasks and the data development as well as the results. The best performing algorithms were mainly variants of the transformer architecture BERT. However, also other systems were applied with good success
△ Less
Submitted 12 August, 2021;
originally announced August 2021.
-
Tiplines to Combat Misinformation on Encrypted Platforms: A Case Study of the 2019 Indian Election on WhatsApp
Authors:
Ashkan Kazemi,
Kiran Garimella,
Gautam Kishore Shahi,
Devin Gaffney,
Scott A. Hale
Abstract:
There is currently no easy way to fact-check content on WhatsApp and other end-to-end encrypted platforms at scale. In this paper, we analyze the usefulness of a crowd-sourced "tipline" through which users can submit content ("tips") that they want fact-checked. We compare the tips sent to a WhatsApp tipline run during the 2019 Indian national elections with the messages circulating in large, publ…
▽ More
There is currently no easy way to fact-check content on WhatsApp and other end-to-end encrypted platforms at scale. In this paper, we analyze the usefulness of a crowd-sourced "tipline" through which users can submit content ("tips") that they want fact-checked. We compare the tips sent to a WhatsApp tipline run during the 2019 Indian national elections with the messages circulating in large, public groups on WhatsApp and other social media platforms during the same period. We find that tiplines are a very useful lens into WhatsApp conversations: a significant fraction of messages and images sent to the tipline match with the content being shared on public WhatsApp groups and other social media. Our analysis also shows that tiplines cover the most popular content well, and a majority of such content is often shared to the tipline before appearing in large, public WhatsApp groups. Overall, our findings suggest tiplines can be an effective source for discovering content to fact-check.
△ Less
Submitted 23 July, 2021; v1 submitted 8 June, 2021;
originally announced June 2021.
-
AMUSED: An Annotation Framework of Multi-modal Social Media Data
Authors:
Gautam Kishore Shahi
Abstract:
In this paper, we present a semi-automated framework called AMUSED for gathering multi-modal annotated data from the multiple social media platforms. The framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process. From a given list of the articles from professional news media or blog, AMUSED d…
▽ More
In this paper, we present a semi-automated framework called AMUSED for gathering multi-modal annotated data from the multiple social media platforms. The framework is designed to mitigate the issues of collecting and annotating social media data by cohesively combining machine and human in the data collection process. From a given list of the articles from professional news media or blog, AMUSED detects links to the social media posts from news articles and then downloads contents of the same post from the respective social media platform to gather details about that specific post. The framework is capable of fetching the annotated data from multiple platforms like Twitter, YouTube, Reddit. The framework aims to reduce the workload and problems behind the data annotation from the social media platforms. AMUSED can be applied in multiple application domains, as a use case, we have implemented the framework for collecting COVID-19 misinformation data from different social media platforms.
△ Less
Submitted 10 August, 2021; v1 submitted 1 October, 2020;
originally announced October 2020.
-
FakeCovid -- A Multilingual Cross-domain Fact Check News Dataset for COVID-19
Authors:
Gautam Kishore Shahi,
Durgesh Nandini
Abstract:
In this paper, we present a first multilingual cross-domain dataset of 5182 fact-checked news articles for COVID-19, collected from 04/01/2020 to 15/05/2020. We have collected the fact-checked articles from 92 different fact-checking websites after obtaining references from Poynter and Snopes. We have manually annotated articles into 11 different categories of the fact-checked news according to th…
▽ More
In this paper, we present a first multilingual cross-domain dataset of 5182 fact-checked news articles for COVID-19, collected from 04/01/2020 to 15/05/2020. We have collected the fact-checked articles from 92 different fact-checking websites after obtaining references from Poynter and Snopes. We have manually annotated articles into 11 different categories of the fact-checked news according to their content. The dataset is in 40 languages from 105 countries. We have built a classifier to detect fake news and present results for the automatic fake news detection and its class. Our model achieves an F1 score of 0.76 to detect the false class and other fact check articles. The FakeCovid dataset is available at Github.
△ Less
Submitted 19 June, 2020;
originally announced June 2020.
-
An Exploratory Study of COVID-19 Misinformation on Twitter
Authors:
Gautam Kishore Shahi,
Anne Dirkson,
Tim A. Majchrzak
Abstract:
During the COVID-19 pandemic, social media has become a home ground for misinformation. To tackle this infodemic, scientific oversight, as well as a better understanding by practitioners in crisis management, is needed. We have conducted an exploratory study into the propagation, authors and content of misinformation on Twitter around the topic of COVID-19 in order to gain early insights. We have…
▽ More
During the COVID-19 pandemic, social media has become a home ground for misinformation. To tackle this infodemic, scientific oversight, as well as a better understanding by practitioners in crisis management, is needed. We have conducted an exploratory study into the propagation, authors and content of misinformation on Twitter around the topic of COVID-19 in order to gain early insights. We have collected all tweets mentioned in the verdicts of fact-checked claims related to COVID-19 by over 92 professional fact-checking organisations between January and mid-July 2020 and share this corpus with the community. This resulted in 1 500 tweets relating to 1 274 false and 276 partially false claims, respectively. Exploratory analysis of author accounts revealed that the verified twitter handle(including Organisation/celebrity) are also involved in either creating (new tweets) or spreading (retweet) the misinformation. Additionally, we found that false claims propagate faster than partially false claims. Compare to a background corpus of COVID-19 tweets, tweets with misinformation are more often concerned with discrediting other information on social media. Authors use less tentative language and appear to be more driven by concerns of potential harm to others. Our results enable us to suggest gaps in the current scientific coverage of the topic as well as propose actions for authorities and social media users to counter misinformation.
△ Less
Submitted 24 August, 2020; v1 submitted 12 May, 2020;
originally announced May 2020.