-
Gender Stereotypes in Professional Roles Among Saudis: An Analytical Study of AI-Generated Images Using Language Models
Authors:
Khaloud S. AlKhalifah,
Malak Mashaabi,
Hend Al-Khalifa
Abstract:
This study investigates the extent to which contemporary Text-to-Image artificial intelligence (AI) models perpetuate gender stereotypes and cultural inaccuracies when generating depictions of professionals in Saudi Arabia. We analyzed 1,006 images produced by ImageFX, DALL-E V3, and Grok for 56 diverse Saudi professions using neutral prompts. Two trained Saudi annotators evaluated each image on f…
▽ More
This study investigates the extent to which contemporary Text-to-Image artificial intelligence (AI) models perpetuate gender stereotypes and cultural inaccuracies when generating depictions of professionals in Saudi Arabia. We analyzed 1,006 images produced by ImageFX, DALL-E V3, and Grok for 56 diverse Saudi professions using neutral prompts. Two trained Saudi annotators evaluated each image on five dimensions: perceived gender, clothing and appearance, background and setting, activities and interactions, and age. A third senior researcher adjudicated whenever the two primary raters disagreed, yielding 10,100 individual judgements. The results reveal a strong gender imbalance, with ImageFX outputs being 85\% male, Grok 86.6\% male, and DALL-E V3 96\% male, indicating that DALL-E V3 exhibited the strongest overall gender stereotyping. This imbalance was most evident in leadership and technical roles. Moreover, cultural inaccuracies in clothing, settings, and depicted activities were frequently observed across all three models. Counter-stereotypical images often arise from cultural misinterpretations rather than genuinely progressive portrayals. We conclude that current models mirror societal biases embedded in their training data, generated by humans, offering only a limited reflection of the Saudi labour market's gender dynamics and cultural nuances. These findings underscore the urgent need for more diverse training data, fairer algorithms, and culturally sensitive evaluation frameworks to ensure equitable and authentic visual outputs.
△ Less
Submitted 25 September, 2025;
originally announced September 2025.
-
DAIQ: Auditing Demographic Attribute Inference from Question in LLMs
Authors:
Srikant Panda,
Hitesh Laxmichand Patel,
Shahad Al-Khalifa,
Amit Agarwal,
Hend Al-Khalifa,
Sharefah Al-Ghamdi
Abstract:
Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demogr…
▽ More
Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education.
We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing.
Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.
△ Less
Submitted 18 August, 2025;
originally announced August 2025.
-
The Prompting Brain: Neurocognitive Markers of Expertise in Guiding Large Language Models
Authors:
Hend Al-Khalifa,
Raneem Almansour,
Layan Abdulrahman Alhuasini,
Alanood Alsaleh,
Mohamad-Hani Temsah,
Mohamad-Hani_Temsah,
Ashwag Rafea S Alruwaili
Abstract:
Prompt engineering has rapidly emerged as a critical skill for effective interaction with large language models (LLMs). However, the cognitive and neural underpinnings of this expertise remain largely unexplored. This paper presents findings from a cross-sectional pilot fMRI study investigating differences in brain functional connectivity and network activity between experts and intermediate promp…
▽ More
Prompt engineering has rapidly emerged as a critical skill for effective interaction with large language models (LLMs). However, the cognitive and neural underpinnings of this expertise remain largely unexplored. This paper presents findings from a cross-sectional pilot fMRI study investigating differences in brain functional connectivity and network activity between experts and intermediate prompt engineers. Our results reveal distinct neural signatures associated with higher prompt engineering literacy, including increased functional connectivity in brain regions such as the left middle temporal gyrus and the left frontal pole, as well as altered power-frequency dynamics in key cognitive networks. These findings offer initial insights into the neurobiological basis of prompt engineering proficiency. We discuss the implications of these neurocognitive markers in Natural Language Processing (NLP). Understanding the neural basis of human expertise in interacting with LLMs can inform the design of more intuitive human-AI interfaces, contribute to cognitive models of LLM interaction, and potentially guide the development of AI systems that better align with human cognitive workflows. This interdisciplinary approach aims to bridge the gap between human cognition and machine intelligence, fostering a deeper understanding of how humans learn and adapt to complex AI systems.
△ Less
Submitted 20 August, 2025;
originally announced August 2025.
-
The Landscape of Arabic Large Language Models (ALLMs): A New Era for Arabic Language Technology
Authors:
Shahad Al-Khalifa,
Nadir Durrani,
Hend Al-Khalifa,
Firoj Alam
Abstract:
The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speakin…
▽ More
The emergence of ChatGPT marked a transformative milestone for Artificial Intelligence (AI), showcasing the remarkable potential of Large Language Models (LLMs) to generate human-like text. This wave of innovation has revolutionized how we interact with technology, seamlessly integrating LLMs into everyday tasks such as vacation planning, email drafting, and content creation. While English-speaking users have significantly benefited from these advancements, the Arabic world faces distinct challenges in developing Arabic-specific LLMs. Arabic, one of the languages spoken most widely around the world, serves more than 422 million native speakers in 27 countries and is deeply rooted in a rich linguistic and cultural heritage. Developing Arabic LLMs (ALLMs) presents an unparalleled opportunity to bridge technological gaps and empower communities. The journey of ALLMs has been both fascinating and complex, evolving from rudimentary text processing systems to sophisticated AI-driven models. This article explores the trajectory of ALLMs, from their inception to the present day, highlighting the efforts to evaluate these models through benchmarks and public leaderboards. We also discuss the challenges and opportunities that ALLMs present for the Arab world.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
MultiProSE: A Multi-label Arabic Dataset for Propaganda, Sentiment, and Emotion Detection
Authors:
Lubna Al-Henaki,
Hend Al-Khalifa,
Abdulmalik Al-Salman,
Hajar Alqubayshi,
Hind Al-Twailay,
Gheeda Alghamdi,
Hawra Aljasim
Abstract:
Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most-used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To a…
▽ More
Propaganda is a form of persuasion that has been used throughout history with the intention goal of influencing people's opinions through rhetorical and psychological persuasion techniques for determined ends. Although Arabic ranked as the fourth most-used language on the internet, resources for propaganda detection in languages other than English, especially Arabic, remain extremely limited. To address this gap, the first Arabic dataset for Multi-label Propaganda, Sentiment, and Emotion (MultiProSE) has been introduced. MultiProSE is an open-source extension of the existing Arabic propaganda dataset, ArPro, with the addition of sentiment and emotion annotations for each text. This dataset comprises 8,000 annotated news articles, which is the largest propaganda dataset to date. For each task, several baselines have been developed using large language models (LLMs), such as GPT-4o-mini, and pre-trained language models (PLMs), including three BERT-based models. The dataset, annotation guidelines, and source code are all publicly released to facilitate future research and development in Arabic language models and contribute to a deeper understanding of how various opinion dimensions interact in news media1.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
GLARE: Google Apps Arabic Reviews Dataset
Authors:
Fatima AlGhamdi,
Reem Mohammed,
Hend Al-Khalifa,
Areeb Alowisheq
Abstract:
This paper introduces GLARE an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews of 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset…
▽ More
This paper introduces GLARE an Arabic Apps Reviews dataset collected from Saudi Google PlayStore. It consists of 76M reviews, 69M of which are Arabic reviews of 9,980 Android Applications. We present the data collection methodology, along with a detailed Exploratory Data Analysis (EDA) and Feature Engineering on the gathered reviews. We also highlight possible use cases and benefits of the dataset.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
A Survey of Large Language Models for Arabic Language and its Dialects
Authors:
Malak Mashaabi,
Shahad Al-Khalifa,
Hend Al-Khalifa
Abstract:
This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingu…
▽ More
This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.
△ Less
Submitted 24 February, 2025; v1 submitted 26 October, 2024;
originally announced October 2024.
-
CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Authors:
Mashael Al-Duwais,
Hend Al-Khalifa,
Abdulmalik Al-Salman
Abstract:
Label errors are a common issue in machine learning datasets, particularly for tasks such as Named Entity Recognition. Such label errors might hurt model training, affect evaluation results, and lead to an inaccurate assessment of model performance. In this study, we dived deep into one of the widely adopted Arabic NER benchmark datasets (ANERcorp) and found a significant number of annotation erro…
▽ More
Label errors are a common issue in machine learning datasets, particularly for tasks such as Named Entity Recognition. Such label errors might hurt model training, affect evaluation results, and lead to an inaccurate assessment of model performance. In this study, we dived deep into one of the widely adopted Arabic NER benchmark datasets (ANERcorp) and found a significant number of annotation errors, missing labels, and inconsistencies. Therefore, in this study, we conducted empirical research to understand these errors, correct them and propose a cleaner version of the dataset named CLEANANERCorp. CLEANANERCorp will serve the research community as a more accurate and consistent benchmark.
△ Less
Submitted 8 March, 2025; v1 submitted 22 August, 2024;
originally announced August 2024.
-
The Qiyas Benchmark: Measuring ChatGPT Mathematical and Language Understanding in Arabic
Authors:
Shahad Al-Khalifa,
Hend Al-Khalifa
Abstract:
Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in A…
▽ More
Despite the growing importance of Arabic as a global language, there is a notable lack of language models pre-trained exclusively on Arabic data. This shortage has led to limited benchmarks available for assessing language model performance in Arabic. To address this gap, we introduce two novel benchmarks designed to evaluate models' mathematical reasoning and language understanding abilities in Arabic. These benchmarks are derived from a General Aptitude Test (GAT) called Qiyas exam, a standardized test widely used for university admissions in Saudi Arabia. For validation purposes, we assess the performance of ChatGPT-3.5-trubo and ChatGPT-4 on our benchmarks. Our findings reveal that these benchmarks pose a significant challenge, with ChatGPT-4 achieving an overall average accuracy of 64%, while ChatGPT-3.5-trubo achieved an overall accuracy of 49% across the various question types in the Qiyas benchmark. We believe the release of these benchmarks will pave the way for enhancing the mathematical reasoning and language understanding capabilities of future models tailored for the low-resource Arabic language.
△ Less
Submitted 28 June, 2024;
originally announced July 2024.
-
Towards Designing a ChatGPT Conversational Companion for Elderly People
Authors:
Abeer Alessa,
Hend Al-Khalifa
Abstract:
Loneliness and social isolation are serious and widespread problems among older people, affecting their physical and mental health, quality of life, and longevity. In this paper, we propose a ChatGPT-based conversational companion system for elderly people. The system is designed to provide companionship and help reduce feelings of loneliness and social isolation. The system was evaluated with a p…
▽ More
Loneliness and social isolation are serious and widespread problems among older people, affecting their physical and mental health, quality of life, and longevity. In this paper, we propose a ChatGPT-based conversational companion system for elderly people. The system is designed to provide companionship and help reduce feelings of loneliness and social isolation. The system was evaluated with a preliminary study. The results showed that the system was able to generate responses that were relevant to the created elderly personas. However, it is essential to acknowledge the limitations of ChatGPT, such as potential biases and misinformation, and to consider the ethical implications of using AI-based companionship for the elderly, including privacy concerns.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
The Saudi Privacy Policy Dataset
Authors:
Hend Al-Khalifa,
Malak Mashaabi,
Ghadi Al-Yahya,
Raghad Alnashwan
Abstract:
This paper introduces the Saudi Privacy Policy Dataset, a diverse compilation of Arabic privacy policies from various sectors in Saudi Arabia, annotated according to the 10 principles of the Personal Data Protection Law (PDPL); the PDPL was established to be compatible with General Data Protection Regulation (GDPR); one of the most comprehensive data regulations worldwide. Data were collected from…
▽ More
This paper introduces the Saudi Privacy Policy Dataset, a diverse compilation of Arabic privacy policies from various sectors in Saudi Arabia, annotated according to the 10 principles of the Personal Data Protection Law (PDPL); the PDPL was established to be compatible with General Data Protection Regulation (GDPR); one of the most comprehensive data regulations worldwide. Data were collected from multiple sources, including the Saudi Central Bank, the Saudi Arabia National United Platform, the Council of Health Insurance, and general websites using Google and Wikipedia. The final dataset includes 1,000 websites belonging to 7 sectors, 4,638 lines of text, 775,370 tokens, and a corpus size of 8,353 KB. The annotated dataset offers significant reuse potential for assessing privacy policy compliance, benchmarking privacy practices across industries, and developing automated tools for monitoring adherence to data protection regulations. By providing a comprehensive and annotated dataset of privacy policies, this paper aims to facilitate further research and development in the areas of privacy policy analysis, natural language processing, and machine learning applications related to privacy and data protection, while also serving as an essential resource for researchers, policymakers, and industry professionals interested in understanding and promoting compliance with privacy regulations in Saudi Arabia.
△ Less
Submitted 5 April, 2023;
originally announced April 2023.
-
Natural Language Processing in Customer Service: A Systematic Review
Authors:
Malak Mashaabi,
Areej Alotaibi,
Hala Qudaih,
Raghad Alnashwan,
Hend Al-Khalifa
Abstract:
Artificial intelligence and natural language processing (NLP) are increasingly being used in customer service to interact with users and answer their questions. The goal of this systematic review is to examine existing research on the use of NLP technology in customer service, including the research domain, applications, datasets used, and evaluation methods. The review also looks at the future di…
▽ More
Artificial intelligence and natural language processing (NLP) are increasingly being used in customer service to interact with users and answer their questions. The goal of this systematic review is to examine existing research on the use of NLP technology in customer service, including the research domain, applications, datasets used, and evaluation methods. The review also looks at the future direction of the field and any significant limitations. The review covers the time period from 2015 to 2022 and includes papers from five major scientific databases. Chatbots and question-answering systems were found to be used in 10 main fields, with the most common use in general, social networking, and e-commerce areas. Twitter was the second most commonly used dataset, with most research also using their own original datasets. Accuracy, precision, recall, and F1 were the most common evaluation methods. Future work aims to improve the performance and understanding of user behavior and emotions, and address limitations such as the volume, diversity, and quality of datasets. This review includes research on different spoken languages and models and techniques.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
Handwritten Arabic Character Recognition for Children Writ-ing Using Convolutional Neural Network and Stroke Identification
Authors:
Mais Alheraki,
Rawan Al-Matham,
Hend Al-Khalifa
Abstract:
Automatic Arabic handwritten recognition is one of the recently studied problems in the field of Machine Learning. Unlike Latin languages, Arabic is a Semitic language that forms a harder challenge, especially with variability of patterns caused by factors such as writer age. Most of the studies focused on adults, with only one recent study on children. Moreover, much of the recent Machine Learnin…
▽ More
Automatic Arabic handwritten recognition is one of the recently studied problems in the field of Machine Learning. Unlike Latin languages, Arabic is a Semitic language that forms a harder challenge, especially with variability of patterns caused by factors such as writer age. Most of the studies focused on adults, with only one recent study on children. Moreover, much of the recent Machine Learning methods focused on using Convolutional Neural Networks, a powerful class of neural networks that can extract complex features from images. In this paper we propose a convolutional neural network (CNN) model that recognizes children handwriting with an accuracy of 91% on the Hijja dataset, a recent dataset built by collecting images of the Arabic characters written by children, and 97% on Arabic Handwritten Character Dataset. The results showed a good improvement over the proposed model from the Hijja dataset authors, yet it reveals a bigger challenge to solve for children Arabic handwritten character recognition. Moreover, we proposed a new approach using multi models instead of single model based on the number of strokes in a character, and merged Hijja with AHCD which reached an averaged prediction accuracy of 96%.
△ Less
Submitted 3 November, 2022;
originally announced November 2022.
-
A Panoramic Survey of Natural Language Processing in the Arab World
Authors:
Kareem Darwish,
Nizar Habash,
Mourad Abbas,
Hend Al-Khalifa,
Huseein T. Al-Natsheh,
Samhaa R. El-Beltagy,
Houda Bouamor,
Karim Bouzoubaa,
Violetta Cavalli-Sforza,
Wassim El-Hajj,
Mustafa Jarrar,
Hamdy Mubarak
Abstract:
The term natural language refers to any system of symbolic communication (spoken, signed or written) without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NLP) is the sub-field of artificial intelligence (AI) focused on modeling natural languag…
▽ More
The term natural language refers to any system of symbolic communication (spoken, signed or written) without intentional human planning and design. This distinguishes natural languages such as Arabic and Japanese from artificially constructed languages such as Esperanto or Python. Natural language processing (NLP) is the sub-field of artificial intelligence (AI) focused on modeling natural languages to build applications such as speech recognition and synthesis, machine translation, optical character recognition (OCR), sentiment analysis (SA), question answering, dialogue systems, etc. NLP is a highly interdisciplinary field with connections to computer science, linguistics, cognitive science, psychology, mathematics and others. Some of the earliest AI applications were in NLP (e.g., machine translation); and the last decade (2010-2020) in particular has witnessed an incredible increase in quality, matched with a rise in public awareness, use, and expectations of what may have seemed like science fiction in the past. NLP researchers pride themselves on developing language independent models and tools that can be applied to all human languages, e.g. machine translation systems can be built for a variety of languages using the same basic mechanisms and models. However, the reality is that some languages do get more attention (e.g., English and Chinese) than others (e.g., Hindi and Swahili). Arabic, the primary language of the Arab world and the religious language of millions of non-Arab Muslims is somewhere in the middle of this continuum. Though Arabic NLP has many challenges, it has seen many successes and developments. Next we discuss Arabic's main challenges as a necessary background, and we present a brief history of Arabic NLP. We then survey a number of its research areas, and close with a critical discussion of the future of Arabic NLP.
△ Less
Submitted 27 September, 2021; v1 submitted 25 November, 2020;
originally announced November 2020.
-
Sentiment Analysis of Arabic Tweets: Feature Engineering and A Hybrid Approach
Authors:
Nora Al-Twairesh,
Hend Al-Khalifa,
AbdulMalik Alsalman,
Yousef Al-Ohali
Abstract:
Sentiment Analysis in Arabic is a challenging task due to the rich morphology of the language. Moreover, the task is further complicated when applied to Twitter data that is known to be highly informal and noisy. In this paper, we develop a hybrid method for sentiment analysis for Arabic tweets for a specific Arabic dialect which is the Saudi Dialect. Several features were engineered and evaluated…
▽ More
Sentiment Analysis in Arabic is a challenging task due to the rich morphology of the language. Moreover, the task is further complicated when applied to Twitter data that is known to be highly informal and noisy. In this paper, we develop a hybrid method for sentiment analysis for Arabic tweets for a specific Arabic dialect which is the Saudi Dialect. Several features were engineered and evaluated using a feature backward selection method. Then a hybrid method that combines a corpus-based and lexicon-based method was developed for several classification models (two-way, three-way, four-way). The best F1-score for each of these models was (69.9,61.63,55.07) respectively.
△ Less
Submitted 22 May, 2018;
originally announced May 2018.