Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Authors:
Fakhraddin Alwajih,
Samar Mohamed Magdy,
Abdellah El Mekki,
Omer Nacar,
Youssef Nafea,
Safaa Taher Abdelfadil,
Abdulfattah Mohammed Yahya,
Hamzah Luqman,
Nada Almarwani,
Samah Aloufi,
Baraah Qawasmeh,
Houdaifa Atou,
Serry Sibaee,
Hamzah A. Alsayadi,
Walid Al-Dhabyani,
Maged S. Al-shaibani,
Aya El aatar,
Nour Qandos,
Rahaf Alhamouri,
Samar Ahmad,
Razan Khassib,
Lina Hamad,
Mohammed Anwar AL-Ghrawi,
Fatimah Alshamari,
Cheikh Malainine
, et al. (20 additional authors not shown)
Abstract:
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across…
▽ More
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
△ Less
Submitted 22 June, 2025; v1 submitted 28 May, 2025;
originally announced May 2025.
Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Authors:
Fakhraddin Alwajih,
Abdellah El Mekki,
Samar Mohamed Magdy,
Abdelrahim A. Elmadany,
Omer Nacar,
El Moatez Billah Nagoudi,
Reem Abdel-Salam,
Hanin Atwany,
Youssef Nafea,
Abdulfattah Mohammed Yahya,
Rahaf Alhamouri,
Hamzah A. Alsayadi,
Hiba Zayed,
Sara Shatnawi,
Serry Sibaee,
Yasir Ech-Chammakhy,
Walid Al-Dhabyani,
Marwa Mohamed Ali,
Imen Jarraya,
Ahmed Oumar El-Shangiti,
Aisha Alraeesi,
Mohammed Anwar Al-Ghrawi,
Abdulrahman S. Al-Batati,
Elgizouli Mohamed,
Noha Taha Elgindi
, et al. (19 additional authors not shown)
Abstract:
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by…
▽ More
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.