-
VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment
Authors:
Shaina Raza,
Ashmal Vayani,
Aditya Jain,
Aravind Narayanan,
Vahid Reza Khazaie,
Syed Raza Bashir,
Elham Dolatabadi,
Gias Uddin,
Christos Emmanouilidis,
Rizwan Qureshi,
Mubarak Shah
Abstract:
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitat…
▽ More
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news, remains largely unaddressed. We introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluations of state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) on VLDBench show that incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. VLDBench provides data and code for evaluation, fine-tuning, and robustness testing to support disinformation analysis. Developed in alignment with AI governance frameworks (e.g., the MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media.
Project: https://vectorinstitute.github.io/VLDBench/ Dataset: https://huggingface.co/datasets/vector-institute/VLDBench Code: https://github.com/VectorInstitute/VLDBench
△ Less
Submitted 30 May, 2025; v1 submitted 16 February, 2025;
originally announced February 2025.
-
Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models?
Authors:
Shaina Raza,
Oluwanifemi Bamgbose,
Shardul Ghuge,
Fatemeh Tavakol,
Deepak John Reji,
Syed Raza Bashir
Abstract:
Large Language Models (LLMs) have advanced various Natural Language Processing (NLP) tasks, such as text generation and translation, among others. However, these models often generate texts that can perpetuate biases. Existing approaches to mitigate these biases usually compromise knowledge retention. This study explores whether LLMs can produce safe, unbiased outputs without sacrificing knowledge…
▽ More
Large Language Models (LLMs) have advanced various Natural Language Processing (NLP) tasks, such as text generation and translation, among others. However, these models often generate texts that can perpetuate biases. Existing approaches to mitigate these biases usually compromise knowledge retention. This study explores whether LLMs can produce safe, unbiased outputs without sacrificing knowledge or comprehension. We introduce the Safe and Responsible Large Language Model (\textbf{SR}$_{\text{LLM}}$), which has been instruction fine-tuned atop of a safe fine-tuned auto-regressive decoder-only LLM to reduce biases in generated texts. We developed a specialized dataset with examples of unsafe and corresponding safe variations to train \textbf{SR}$_{\text{LLM}}$ to identify and correct biased text. Experiments on our specialized dataset and out-of-distribution test sets reveal that \textbf{SR}$_{\text{LLM}}$ effectively reduces biases while preserving knowledge integrity. This performance surpasses that of traditional fine-tuning of smaller language models and base LLMs that merely reply on prompting techniques. Our findings demonstrate that instruction fine-tuning on custom datasets tailored for tasks such as debiasing is a highly effective strategy for minimizing bias in LLM while preserving their inherent knowledge and capabilities. The code and dataset are accessible at \href{https://github.com/shainarazavi/Safe-Responsible-LLM}{SR-LLM}
△ Less
Submitted 5 January, 2025; v1 submitted 1 April, 2024;
originally announced April 2024.
-
A Narrative Review of Identity, Data, and Location Privacy Techniques in Edge Computing and Mobile Crowdsourcing
Authors:
Syed Raza Bashir,
Shaina Raza,
Vojislav Misic
Abstract:
As digital technology advances, the proliferation of connected devices poses significant challenges and opportunities in mobile crowdsourcing and edge computing. This narrative review focuses on the need for privacy protection in these fields, emphasizing the increasing importance of data security in a data-driven world. Through an analysis of contemporary academic literature, this review provides…
▽ More
As digital technology advances, the proliferation of connected devices poses significant challenges and opportunities in mobile crowdsourcing and edge computing. This narrative review focuses on the need for privacy protection in these fields, emphasizing the increasing importance of data security in a data-driven world. Through an analysis of contemporary academic literature, this review provides an understanding of the current trends and privacy concerns in mobile crowdsourcing and edge computing. We present insights and highlight advancements in privacy-preserving techniques, addressing identity, data, and location privacy. This review also discusses the potential directions that can be useful resources for researchers, industry professionals, and policymakers.
△ Less
Submitted 28 October, 2024; v1 submitted 20 January, 2024;
originally announced January 2024.
-
NBIAS: A Natural Language Processing Framework for Bias Identification in Text
Authors:
Shaina Raza,
Muskan Garg,
Deepak John Reji,
Syed Raza Bashir,
Chen Ding
Abstract:
Bias in textual data can lead to skewed interpretations and outcomes when the data is used. These biases could perpetuate stereotypes, discrimination, or other forms of unfair treatment. An algorithm trained on biased data may end up making decisions that disproportionately impact a certain group of people. Therefore, it is crucial to detect and remove these biases to ensure the fair and ethical u…
▽ More
Bias in textual data can lead to skewed interpretations and outcomes when the data is used. These biases could perpetuate stereotypes, discrimination, or other forms of unfair treatment. An algorithm trained on biased data may end up making decisions that disproportionately impact a certain group of people. Therefore, it is crucial to detect and remove these biases to ensure the fair and ethical use of data. To this end, we develop a comprehensive and robust framework NBIAS that consists of four main layers: data, corpus construction, model development and an evaluation layer. The dataset is constructed by collecting diverse data from various domains, including social media, healthcare, and job hiring portals. As such, we applied a transformer-based token classification model that is able to identify bias words/ phrases through a unique named entity BIAS. In the evaluation procedure, we incorporate a blend of quantitative and qualitative measures to gauge the effectiveness of our models. We achieve accuracy improvements ranging from 1% to 8% compared to baselines. We are also able to generate a robust understanding of the model functioning. The proposed approach is applicable to a variety of biases and contributes to the fair and ethical use of textual data.
△ Less
Submitted 29 August, 2023; v1 submitted 3 August, 2023;
originally announced August 2023.
-
Fairness in Machine Learning meets with Equity in Healthcare
Authors:
Shaina Raza,
Parisa Osivand Pour,
Syed Raza Bashir
Abstract:
With the growing utilization of machine learning in healthcare, there is increasing potential to enhance healthcare outcomes. However, this also brings the risk of perpetuating biases in data and model design that can harm certain demographic groups based on factors such as age, gender, and race. This study proposes an artificial intelligence framework, grounded in software engineering principles,…
▽ More
With the growing utilization of machine learning in healthcare, there is increasing potential to enhance healthcare outcomes. However, this also brings the risk of perpetuating biases in data and model design that can harm certain demographic groups based on factors such as age, gender, and race. This study proposes an artificial intelligence framework, grounded in software engineering principles, for identifying and mitigating biases in data and models while ensuring fairness in healthcare settings. A case study is presented to demonstrate how systematic biases in data can lead to amplified biases in model predictions, and machine learning methods are suggested to prevent such biases. Future research aims to test and validate the proposed ML framework in real-world clinical settings to evaluate its impact on promoting health equity.
△ Less
Submitted 14 August, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Leveraging Foundation Models for Clinical Text Analysis
Authors:
Shaina Raza,
Syed Raza Bashir
Abstract:
Infectious diseases are a significant public health concern globally, and extracting relevant information from scientific literature can facilitate the development of effective prevention and treatment strategies. However, the large amount of clinical data available presents a challenge for information extraction. To address this challenge, this study proposes a natural language processing (NLP) f…
▽ More
Infectious diseases are a significant public health concern globally, and extracting relevant information from scientific literature can facilitate the development of effective prevention and treatment strategies. However, the large amount of clinical data available presents a challenge for information extraction. To address this challenge, this study proposes a natural language processing (NLP) framework that uses a pre-trained transformer model fine-tuned on task-specific data to extract key information related to infectious diseases from free-text clinical data. The proposed framework includes three components: a data layer for preparing datasets from clinical texts, a foundation model layer for entity extraction, and an assessment layer for performance analysis. The results of the evaluation indicate that the proposed method outperforms standard methods, and leveraging prior knowledge through the pre-trained transformer model makes it useful for investigating other infectious diseases in the future.
△ Less
Submitted 20 March, 2023;
originally announced March 2023.
-
Addressing Biases in the Texts using an End-to-End Pipeline Approach
Authors:
Shaina Raza,
Syed Raza Bashir,
Sneha,
Urooj Qamar
Abstract:
The concept of fairness is gaining popularity in academia and industry. Social media is especially vulnerable to media biases and toxic language and comments. We propose a fair ML pipeline that takes a text as input and determines whether it contains biases and toxic content. Then, based on pre-trained word embeddings, it suggests a set of new words by substituting the bi-ased words, the idea is t…
▽ More
The concept of fairness is gaining popularity in academia and industry. Social media is especially vulnerable to media biases and toxic language and comments. We propose a fair ML pipeline that takes a text as input and determines whether it contains biases and toxic content. Then, based on pre-trained word embeddings, it suggests a set of new words by substituting the bi-ased words, the idea is to lessen the effects of those biases by replacing them with alternative words. We compare our approach to existing fairness models to determine its effectiveness. The results show that our proposed pipeline can de-tect, identify, and mitigate biases in social media data
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
BERT4Loc: BERT for Location -- POI Recommender System
Authors:
Syed Raza Bashir,
Shaina Raza,
Vojislav Misic
Abstract:
Recommending points of interest (POIs) is a challenging task that requires extracting comprehensive location data from location-based social media platforms. To provide effective location-based recommendations, it's important to analyze users' historical behavior and preferences. In this study, we present a sophisticated location-aware recommendation system that uses Bidirectional Encoder Represen…
▽ More
Recommending points of interest (POIs) is a challenging task that requires extracting comprehensive location data from location-based social media platforms. To provide effective location-based recommendations, it's important to analyze users' historical behavior and preferences. In this study, we present a sophisticated location-aware recommendation system that uses Bidirectional Encoder Representations from Transformers (BERT) to offer personalized location-based suggestions. Our model combines location information and user preferences to provide more relevant recommendations compared to models that predict the next POI in a sequence. Our experiments on two benchmark dataset show that our BERT-based model outperforms various state-of-the-art sequential models. Moreover, we see the effectiveness of the proposed model for quality through additional experiments.
△ Less
Submitted 16 May, 2023; v1 submitted 2 August, 2022;
originally announced August 2022.
-
An Approach to Ensure Fairness in News Articles
Authors:
Shaina Raza,
Deepak John Reji,
Dora D. Liu,
Syed Raza Bashir,
Usman Naseem
Abstract:
Recommender systems, information retrieval, and other information access systems present unique challenges for examining and applying concepts of fairness and bias mitigation in unstructured text. This paper introduces Dbias, which is a Python package to ensure fairness in news articles. Dbias is a trained Machine Learning (ML) pipeline that can take a text (e.g., a paragraph or news story) and de…
▽ More
Recommender systems, information retrieval, and other information access systems present unique challenges for examining and applying concepts of fairness and bias mitigation in unstructured text. This paper introduces Dbias, which is a Python package to ensure fairness in news articles. Dbias is a trained Machine Learning (ML) pipeline that can take a text (e.g., a paragraph or news story) and detects if the text is biased or not. Then, it detects the biased words in the text, masks them, and recommends a set of sentences with new words that are bias-free or at least less biased. We incorporate the elements of data science best practices to ensure that this pipeline is reproducible and usable. We show in experiments that this pipeline can be effective for mitigating biases and outperforms the common neural network architectures in ensuring fairness in the news articles.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
Improving Rating and Relevance with Point-of-Interest Recommender System
Authors:
Syed Raza Bashir,
Vojislav Misic
Abstract:
The recommendation of points of interest (POIs) is essential in location-based social networks. It makes it easier for users and locations to share information. Recently, researchers tend to recommend POIs by treating them as large-scale retrieval systems that require a large amount of training data representing query-item relevance. However, gathering user feedback in retrieval systems is an expe…
▽ More
The recommendation of points of interest (POIs) is essential in location-based social networks. It makes it easier for users and locations to share information. Recently, researchers tend to recommend POIs by treating them as large-scale retrieval systems that require a large amount of training data representing query-item relevance. However, gathering user feedback in retrieval systems is an expensive task. Existing POI recommender systems make recommendations based on user and item (location) interactions solely. However, there are numerous sources of feedback to consider. For example, when the user visits a POI, what is the POI is about and such. Integrating all these different types of feedback is essential when developing a POI recommender. In this paper, we propose using user and item information and auxiliary information to improve the recommendation modelling in a retrieval system. We develop a deep neural network architecture to model query-item relevance in the presence of both collaborative and content information. We also improve the quality of the learned representations of queries and items by including the contextual information from the user feedback data. The application of these learned representations to a large-scale dataset resulted in significant improvements.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
A Summary of COVID-19 Datasets
Authors:
Syed Raza Bashir,
Shaina Raza,
Vidhi Thakkar,
Usman Naseem
Abstract:
This research presents a review of main datasets that are developed for COVID-19 research. We hope this collection will continue to bring together members of the computing community, biomedical experts, and policymakers in the pursuit of effective COVID-19 treatments and management policies. Many organizations, such as the World Health Organization (WHO), John Hopkins, National Institute of Health…
▽ More
This research presents a review of main datasets that are developed for COVID-19 research. We hope this collection will continue to bring together members of the computing community, biomedical experts, and policymakers in the pursuit of effective COVID-19 treatments and management policies. Many organizations, such as the World Health Organization (WHO), John Hopkins, National Institute of Health (NIH), COVID-19 open science table4 and such, in the world, have made numerous datasets available to the public. However, these datasets originate from a variety of different sources and initiatives. The purpose of this research is to summarize the open COVID-19 datasets to make them more accessible to the research community for health systems design and analysis.
△ Less
Submitted 27 July, 2022; v1 submitted 6 February, 2022;
originally announced February 2022.
-
Detecting Fake Points of Interest from Location Data
Authors:
Syed Raza Bashir,
Vojislav Misic
Abstract:
The pervasiveness of GPS-enabled mobile devices and the widespread use of location-based services have resulted in the generation of massive amounts of geo-tagged data. In recent times, the data analysis now has access to more sources, including reviews, news, and images, which also raises questions about the reliability of Point-of-Interest (POI) data sources. While previous research attempted to…
▽ More
The pervasiveness of GPS-enabled mobile devices and the widespread use of location-based services have resulted in the generation of massive amounts of geo-tagged data. In recent times, the data analysis now has access to more sources, including reviews, news, and images, which also raises questions about the reliability of Point-of-Interest (POI) data sources. While previous research attempted to detect fake POI data through various security mechanisms, the current work attempts to capture the fake POI data in a much simpler way. The proposed work is focused on supervised learning methods and their capability to find hidden patterns in location-based data. The ground truth labels are obtained through real-world data, and the fake data is generated using an API, so we get a dataset with both the real and fake labels on the location data. The objective is to predict the truth about a POI using the Multi-Layer Perceptron (MLP) method. In the proposed work, MLP based on data classification technique is used to classify location data accurately. The proposed method is compared with traditional classification and robust and recent deep neural methods. The results show that the proposed method is better than the baseline methods.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.