Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Authors:
Fakhraddin Alwajih,
Abdellah El Mekki,
Samar Mohamed Magdy,
Abdelrahim A. Elmadany,
Omer Nacar,
El Moatez Billah Nagoudi,
Reem Abdel-Salam,
Hanin Atwany,
Youssef Nafea,
Abdulfattah Mohammed Yahya,
Rahaf Alhamouri,
Hamzah A. Alsayadi,
Hiba Zayed,
Sara Shatnawi,
Serry Sibaee,
Yasir Ech-Chammakhy,
Walid Al-Dhabyani,
Marwa Mohamed Ali,
Imen Jarraya,
Ahmed Oumar El-Shangiti,
Aisha Alraeesi,
Mohammed Anwar Al-Ghrawi,
Abdulrahman S. Al-Batati,
Elgizouli Mohamed,
Noha Taha Elgindi
, et al. (19 additional authors not shown)
Abstract:
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by…
▽ More
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
△ Less
Submitted 28 February, 2025;
originally announced March 2025.
WaveQ: Gradient-Based Deep Quantization of Neural Networks through Sinusoidal Adaptive Regularization
Authors:
Ahmed T. Elthakeb,
Prannoy Pilligundla,
Fatemehsadat Mireshghallah,
Tarek Elgindi,
Charles-Alban Deledalle,
Hadi Esmaeilzadeh
Abstract:
As deep neural networks make their ways into different domains, their compute efficiency is becoming a first-order constraint. Deep quantization, which reduces the bitwidth of the operations (below 8 bits), offers a unique opportunity as it can reduce both the storage and compute requirements of the network super-linearly. However, if not employed with diligence, this can lead to significant accur…
▽ More
As deep neural networks make their ways into different domains, their compute efficiency is becoming a first-order constraint. Deep quantization, which reduces the bitwidth of the operations (below 8 bits), offers a unique opportunity as it can reduce both the storage and compute requirements of the network super-linearly. However, if not employed with diligence, this can lead to significant accuracy loss. Due to the strong inter-dependence between layers and exhibiting different characteristics across the same network, choosing an optimal bitwidth per layer granularity is not a straight forward. As such, deep quantization opens a large hyper-parameter space, the exploration of which is a major challenge. We propose a novel sinusoidal regularization, called SINAREQ, for deep quantized training. Leveraging the sinusoidal properties, we seek to learn multiple quantization parameterization in conjunction during gradient-based training process. Specifically, we learn (i) a per-layer quantization bitwidth along with (ii) a scale factor through learning the period of the sinusoidal function. At the same time, we exploit the periodicity, differentiability, and the local convexity profile in sinusoidal functions to automatically propel (iii) network weights towards values quantized at levels that are jointly determined. We show how SINAREQ balance compute efficiency and accuracy, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks (AlexNet, CIFAR-10, MobileNet, ResNet-18, ResNet-20, SVHN, and VGG-11) that virtually preserves the accuracy. Furthermore, we carry out experimentation using fixed homogenous bitwidths with 3- to 5-bit assignment and show the versatility of SINAREQ in enhancing quantized training algorithms (DoReFa and WRPN) with about 4.8% accuracy improvements on average, and then outperforming multiple state-of-the-art techniques.
△ Less
Submitted 24 April, 2020; v1 submitted 28 February, 2020;
originally announced March 2020.