Skip to main content

Showing 1–39 of 39 results for author: Elmadany, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2505.21979  [pdf, ps, other

    cs.CL

    Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset

    Authors: Fakhraddin Alwajih, Samar Mohamed Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Razan Khassib, Lina Hamad, Mohammed Anwar AL-Ghrawi, Fatimah Alshamari, Cheikh Malainine , et al. (20 additional authors not shown)

    Abstract: Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across… ▽ More

    Submitted 22 June, 2025; v1 submitted 28 May, 2025; originally announced May 2025.

    Comments: https://github.com/UBC-NLP/pearl

  2. arXiv:2505.18436  [pdf, ps, other

    cs.CL

    Voice of a Continent: Mapping Africa's Speech Technology Frontier

    Authors: AbdelRahim Elmadany, Sang Yun Kwon, Hawau Olamide Toyin, Alcides Alcoba Inciarte, Hanan Aldarmaki, Muhammad Abdul-Mageed

    Abstract: Africa's rich linguistic diversity remains significantly underrepresented in speech technologies, creating barriers to digital inclusion. To alleviate this challenge, we systematically map the continent's speech space of datasets and technologies, leading to a new comprehensive benchmark SimbaBench for downstream African speech tasks. Using SimbaBench, we introduce the Simba family of models, achi… ▽ More

    Submitted 19 June, 2025; v1 submitted 23 May, 2025; originally announced May 2025.

  3. arXiv:2503.00151  [pdf, other

    cs.CL cs.AI

    Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs

    Authors: Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi , et al. (19 additional authors not shown)

    Abstract: As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by… ▽ More

    Submitted 28 February, 2025; originally announced March 2025.

    Comments: More information about our dataset is available at our project page: https://github.com/UBC-NLP/palm

  4. arXiv:2502.19582  [pdf, ps, other

    cs.CL

    Where Are We? Evaluating LLM Performance on African Languages

    Authors: Ife Adebara, Hawau Olamide Toyin, Nahom Tesfu Ghebremichael, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Africa's rich linguistic heritage remains underrepresented in NLP, largely due to historical policies that favor foreign languages and create significant data inequities. In this paper, we integrate theoretical insights on Africa's language landscape with an empirical evaluation using Sahara - a comprehensive benchmark curated from large-scale, publicly accessible datasets capturing the continent'… ▽ More

    Submitted 3 June, 2025; v1 submitted 26 February, 2025; originally announced February 2025.

  5. arXiv:2407.09936  [pdf, other

    cs.CL cs.AI

    WojoodNER 2024: The Second Arabic Named Entity Recognition Shared Task

    Authors: Mustafa Jarrar, Nagham Hamad, Mohammed Khalilia, Bashar Talafha, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: We present WojoodNER-2024, the second Arabic Named Entity Recognition (NER) Shared Task. In WojoodNER-2024, we focus on fine-grained Arabic NER. We provided participants with a new Arabic fine-grained NER dataset called wojoodfine, annotated with subtypes of entities. WojoodNER-2024 encompassed three subtasks: (i) Closed-Track Flat Fine-Grained NER, (ii) Closed-Track Nested Fine-Grained NER, and (… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

  6. arXiv:2407.04910  [pdf, other

    cs.CL cs.AI

    NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task

    Authors: Muhammad Abdul-Mageed, Amr Keleg, AbdelRahim Elmadany, Chiyu Zhang, Injy Hamed, Walid Magdy, Houda Bouamor, Nizar Habash

    Abstract: We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI's objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on pre-specified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Su… ▽ More

    Submitted 5 July, 2024; originally announced July 2024.

    Comments: Accepted by The Second Arabic Natural Language Processing Conference

  7. arXiv:2407.04796  [pdf, other

    cs.CL

    Toucan: Many-to-Many Translation for 150 African Language Pairs

    Authors: AbdelRahim Elmadany, Ife Adebara, Muhammad Abdul-Mageed

    Abstract: We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models… ▽ More

    Submitted 12 July, 2024; v1 submitted 5 July, 2024; originally announced July 2024.

  8. arXiv:2401.01053  [pdf, other

    cs.CL

    Cheetah: Natural Language Generation for 517 African Languages

    Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Low-resource African languages pose unique challenges for natural language processing (NLP) tasks, including natural language generation (NLG). In this paper, we develop Cheetah, a massively multilingual NLG language model for African languages. Cheetah supports 517 African languages and language varieties, allowing us to address the scarcity of NLG resources and provide a solution to foster lingu… ▽ More

    Submitted 10 January, 2024; v1 submitted 2 January, 2024; originally announced January 2024.

  9. arXiv:2310.16153  [pdf, other

    cs.CL

    WojoodNER 2023: The First Arabic Named Entity Recognition Shared Task

    Authors: Mustafa Jarrar, Muhammad Abdul-Mageed, Mohammed Khalilia, Bashar Talafha, AbdelRahim Elmadany, Nagham Hamad, Alaa' Omar

    Abstract: We present WojoodNER-2023, the first Arabic Named Entity Recognition (NER) Shared Task. The primary focus of WojoodNER-2023 is on Arabic NER, offering novel NER datasets (i.e., Wojood) and the definition of subtasks designed to facilitate meaningful comparisons between different NER approaches. WojoodNER-2023 encompassed two Subtasks: FlatNER and NestedNER. A total of 45 unique teams registered fo… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  10. arXiv:2310.16127  [pdf, other

    cs.CL

    Octopus: A Multitask Model and Toolkit for Arabic Natural Language Generation

    Authors: AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: Understanding Arabic text and generating human-like responses is a challenging endeavor. While many researchers have proposed models and solutions for individual problems, there is an acute shortage of a comprehensive Arabic natural language generation toolkit that is capable of handling a wide range of tasks. In this work, we present a novel Arabic text-to-text Transformer model, namely AraT5v2.… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

  11. arXiv:2310.16117  [pdf, other

    cs.CL

    NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

    Authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, Chiyu Zhang, El Moatez Billah Nagoudi, Houda Bouamor, Nizar Habash

    Abstract: We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comp… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: arXiv admin note: text overlap with arXiv:2210.09582

  12. arXiv:2310.11069  [pdf, other

    cs.CL cs.SD eess.AS

    VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System

    Authors: Abdul Waheed, Bashar Talafha, Peter Sullivan, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Arabic is a complex language with many varieties and dialects spoken by over 450 millions all around the world. Due to the linguistic diversity and variations, it is challenging to build a robust and generalized ASR system for Arabic. In this work, we address this gap by developing and demoing a system, dubbed VoxArabica, for dialect identification (DID) as well as automatic speech recognition (AS… ▽ More

    Submitted 27 October, 2023; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Accepted at ArabicNLP conference co-located with EMNLP'23. First three authors contributed equally

  13. arXiv:2306.03789  [pdf, other

    eess.AS cs.CL cs.LG

    On the Robustness of Arabic Speech Dialect Identification

    Authors: Peter Sullivan, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Arabic dialect identification (ADI) tools are an important part of the large-scale data collection pipelines necessary for training speech recognition models. As these pipelines require application of ADI tools to potentially out-of-domain data, we aim to investigate how vulnerable the tools may be to this domain shift. With self-supervised learning (SSL) models as a starting point, we evaluate tr… ▽ More

    Submitted 1 June, 2023; originally announced June 2023.

  14. arXiv:2305.14989  [pdf, other

    cs.CL

    Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Ahmed El-Shangiti, Muhammad Abdul-Mageed

    Abstract: We present Dolphin, a novel benchmark that addresses the need for a natural language generation (NLG) evaluation framework dedicated to the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including dialogue generation, question answering, machine translation, summarization, among others. Dolphin comprises a substantial… ▽ More

    Submitted 24 October, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

  15. arXiv:2304.11256  [pdf, other

    cs.CL

    UBC-DLNLP at SemEval-2023 Task 12: Impact of Transfer Learning on African Sentiment Analysis

    Authors: Gagan Bhatia, Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: We describe our contribution to the SemEVAl 2023 AfriSenti-SemEval shared task, where we tackle the task of sentiment analysis in 14 different African languages. We develop both monolingual and multilingual models under a full supervised setting (subtasks A and B). We also develop models for the zero-shot setting (subtask C). Our approach involves experimenting with transfer learning using six lan… ▽ More

    Submitted 25 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

    Comments: AfriSenti 2023 @ ACL 2023

  16. arXiv:2212.10785  [pdf, other

    cs.CL cs.AI

    SERENGETI: Massively Multilingual Language Models for Africa

    Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

    Abstract: Multilingual pretrained language models (mPLMs) acquire valuable, generalizable linguistic information during pretraining and have advanced the state of the art on task-specific finetuning. To date, only ~31 out of ~2,000 African languages are covered in existing language models. We ameliorate this limitation by developing SERENGETI, a massively multilingual language model that covers 517 African… ▽ More

    Submitted 26 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: To appear in Findings of ACL 2023

  17. arXiv:2212.10758  [pdf, other

    cs.CL cs.AI

    ORCA: A Challenging Benchmark for Arabic Language Understanding

    Authors: AbdelRahim Elmadany, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed

    Abstract: Due to their crucial role in all NLP, several benchmarks have been proposed to evaluate pretrained language models. In spite of these efforts, no public benchmark of diverse nature currently exists for evaluation of Arabic. This makes it challenging to measure progress for both Arabic and multilingual language models. This challenge is compounded by the fact that any benchmark targeting Arabic nee… ▽ More

    Submitted 29 May, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

    Comments: All authors contributed equally. Accepted at ACL 2023, Toronto, Canada

  18. arXiv:2212.10755  [pdf, other

    cs.CL

    JASMINE: Arabic GPT Models for Few-Shot Learning

    Authors: El Moatez Billah Nagoudi, Muhammad Abdul-Mageed, AbdelRahim Elmadany, Alcides Alcoba Inciarte, Md Tawkat Islam Khondaker

    Abstract: Scholarship on generative pretraining (GPT) remains acutely Anglocentric, leaving serious gaps in our understanding of the whole class of autoregressive models. For example, we have little knowledge about the potential of these models and their societal impacts in diverse linguistic and cultural settings. We alleviate this issue for Arabic, a wide collection of languages and dialectal varieties wi… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

  19. arXiv:2210.12314  [pdf, other

    cs.CL

    A Benchmark Study of Contrastive Learning for Arabic Social Meaning

    Authors: Md Tawkat Islam Khondaker, El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan

    Abstract: Contrastive learning (CL) brought significant progress to various NLP tasks. Despite this progress, CL has not been applied to Arabic NLP to date. Nor is it clear how much benefits it could bring to particular classes of tasks such as those involved in Arabic social meaning (e.g., sentiment analysis, dialect identification, hate speech detection). In this work, we present a comprehensive benchmark… ▽ More

    Submitted 21 October, 2022; originally announced October 2022.

  20. arXiv:2210.11744  [pdf, other

    cs.CL cs.LG

    AfroLID: A Neural Language Identification Tool for African Languages

    Authors: Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Alcides Alcoba Inciarte

    Abstract: Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 lang… ▽ More

    Submitted 6 December, 2022; v1 submitted 21 October, 2022; originally announced October 2022.

    Comments: To appear at EMNLP 2022 Main conference

  21. arXiv:2210.09582  [pdf, other

    cs.CL

    NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task

    Authors: Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Houda Bouamor, Nizar Habash

    Abstract: We describe findings of the third Nuanced Arabic Dialect Identification Shared Task (NADI 2022). NADI aims at advancing state of the art Arabic NLP, including on Arabic dialects. It does so by affording diverse datasets and modeling opportunities in a standardized context where meaningful comparisons between models and approaches are possible. NADI 2022 targeted both dialect identification (Subtas… ▽ More

    Submitted 20 October, 2022; v1 submitted 18 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2103.08466

  22. arXiv:2206.03933  [pdf, other

    cs.CL cs.AI cs.LG

    TURJUMAN: A Public Toolkit for Neural Arabic Machine Translation

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: We present TURJUMAN, a neural toolkit for translating from 20 languages into Modern Standard Arabic (MSA). TURJUMAN exploits the recently-introduced text-to-text Transformer AraT5 model, endowing it with a powerful ability to decode into Arabic. The toolkit offers the possibility of employing a number of diverse decoding methods, making it suited for acquiring paraphrases for the MSA translations… ▽ More

    Submitted 27 May, 2022; originally announced June 2022.

    Comments: All authors contributed equally

    Journal ref: Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5), 2022

  23. arXiv:2109.12068  [pdf, other

    cs.CL

    AraT5: Text-to-Text Transformers for Arabic Language Generation

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Transfer learning with a unified Transformer framework (T5) that converts all language problems into a text-to-text format was recently proposed as a simple and effective transfer learning approach. Although a multilingual version of the T5 model (mT5) was also introduced, it is not clear how well it can fare on non-English tasks involving diverse data. To investigate this question, we apply mT5 o… ▽ More

    Submitted 15 March, 2022; v1 submitted 30 August, 2021; originally announced September 2021.

    Comments: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL 2022). All authors contributed equally

  24. arXiv:2105.13573  [pdf, other

    cs.LG cs.CL

    Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed

    Abstract: Recent progress in neural machine translation (NMT) has made it possible to translate successfully between monolingual language pairs where large parallel data exist, with pre-trained models improving performance even further. Although there exists work on translating in code-mixed settings (where one of the pairs includes text from two or more languages), it is still unclear what recent success i… ▽ More

    Submitted 27 May, 2021; originally announced May 2021.

    Comments: CALCS2021, colocated with NAACL-2021

  25. arXiv:2103.08466  [pdf, other

    cs.CL

    NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task

    Authors: Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Houda Bouamor, Nizar Habash

    Abstract: We present the findings and results of the Second Nuanced Arabic Dialect Identification Shared Task (NADI 2021). This Shared Task includes four subtasks: country-level Modern Standard Arabic (MSA) identification (Subtask 1.1), country-level dialect identification (Subtask 1.2), province-level MSA identification (Subtask 2.1), and province-level sub-dialect identification (Subtask 2.2). The shared… ▽ More

    Submitted 18 April, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

    Comments: arXiv admin note: text overlap with arXiv:2010.11334

  26. arXiv:2101.01785  [pdf, other

    cs.CL

    ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic

    Authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi

    Abstract: Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing tw… ▽ More

    Submitted 7 June, 2021; v1 submitted 27 December, 2020; originally announced January 2021.

    Comments: All authors contributed equally. The order is alphabetical

    Journal ref: ACL-2021 camera ready version

  27. arXiv:2011.10970  [pdf, other

    cs.AI

    DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings

    Authors: Muhammad Abdul-Mageed, Shady Elbassuoni, Jad Doughman, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Yorgo Zoughby, Ahmad Shaher, Iskander Gaba, Ahmed Helal, Mohammed El-Razzaz

    Abstract: Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntac… ▽ More

    Submitted 12 March, 2021; v1 submitted 22 November, 2020; originally announced November 2020.

    Comments: WANLP2021

  28. arXiv:2011.03092  [pdf, other

    cs.CL cs.LG

    Machine Generation and Detection of Arabic Manipulated and Fake News

    Authors: El Moatez Billah Nagoudi, AbdelRahim Elmadany, Muhammad Abdul-Mageed, Tariq Alhindi, Hasan Cavusoglu

    Abstract: Fake news and deceptive machine-generated text are serious problems threatening modern societies, including in the Arab world. This motivates work on detecting false and manipulated stories online. However, a bottleneck for this research is lack of sufficient data to train detection models. We present a novel method for automatically generating Arabic manipulated (and potentially fake) news storie… ▽ More

    Submitted 5 November, 2020; originally announced November 2020.

    Comments: 10 pages, accepted in The Fifth Arabic Natural Language Processing Workshop (WANLP 2020)

  29. arXiv:2010.04900  [pdf, other

    cs.CL cs.AI

    Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments

    Authors: Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Lyle Ungar

    Abstract: Although the prediction of dialects is an important language processing task, with a wide range of applications, existing work is largely limited to coarse-grained varieties. Inspired by geolocation research, we propose the novel task of Micro-Dialect Identification (MDI) and introduce MARBERT, a new language model with striking abilities to predict a fine-grained variety (as small as that of a ci… ▽ More

    Submitted 7 December, 2020; v1 submitted 10 October, 2020; originally announced October 2020.

    Comments: Accepted in EMNLP 2020

  30. Holy Tweets: Exploring the Sharing of Quran on Twitter

    Authors: Norah Abokhodair, Abdelrahim Elmadany, Walid Magdy

    Abstract: While social media offer users a platform for self-expression, identity exploration, and community management, among other functions, they also offer space for religious practice and expression. In this paper, we explore social media spaces as they subtend new forms of religious experiences and rituals. We present a mixed-method study to understand the practice of sharing Quran verses on Arabic Tw… ▽ More

    Submitted 19 August, 2020; originally announced August 2020.

    Comments: Paper accepted to The 23rd ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW) 2020

  31. arXiv:2006.01266  [pdf, other

    cs.CL

    Leveraging Affective Bidirectional Transformers for Offensive Language Detection

    Authors: AbdelRahim Elmadany, Chiyu Zhang, Muhammad Abdul-Mageed, Azadeh Hashemi

    Abstract: Social media are pervasive in our life, making it necessary to ensure safe online experiences by detecting and removing offensive and hate speech. In this work, we report our submission to the Offensive Language and hate-speech Detection shared task organized with the 4th Workshop on Open-Source Arabic Corpora and Processing Tools Arabic (OSACT4). We focus on developing purely deep learning system… ▽ More

    Submitted 16 May, 2020; originally announced June 2020.

  32. arXiv:2005.06012  [pdf, other

    cs.SI cs.CL

    Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

    Authors: Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, Rannie Lin

    Abstract: We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not… ▽ More

    Submitted 5 February, 2021; v1 submitted 2 May, 2020; originally announced May 2020.

  33. arXiv:1911.00637  [pdf, other

    cs.CL cs.LG

    Sentence-Level BERT and Multi-Task Learning of Age and Gender in Social Media

    Authors: Muhammad Abdul-Mageed, Chiyu Zhang, Arun Rajendran, AbdelRahim Elmadany, Michael Przystupa, Lyle Ungar

    Abstract: Social media currently provide a window on our lives, making it possible to learn how people from different places, with different backgrounds, ages, and genders use language. In this work we exploit a newly-created Arabic dataset with ground truth age and gender labels to learn these attributes both individually and in a multi-task setting at the sentence level. Our models are based on variations… ▽ More

    Submitted 1 November, 2019; originally announced November 2019.

  34. arXiv:1910.14243  [pdf, other

    cs.CL cs.LG

    DiaNet: BERT and Hierarchical Attention Multi-Task Learning of Fine-Grained Dialect

    Authors: Muhammad Abdul-Mageed, Chiyu Zhang, AbdelRahim Elmadany, Arun Rajendran, Lyle Ungar

    Abstract: Prediction of language varieties and dialects is an important language processing task, with a wide range of applications. For Arabic, the native tongue of ~ 300 million people, most varieties remain unsupported. To ease this bottleneck, we present a very large scale dataset covering 319 cities from all 21 Arab countries. We introduce a hierarchical attention multi-task learning (HA-MTL) approach… ▽ More

    Submitted 30 October, 2019; originally announced October 2019.

  35. arXiv:1806.00522  [pdf

    cs.CL

    Improving Dialogue Act Classification for Spontaneous Arabic Speech and Instant Messages at Utterance Level

    Authors: AbdelRahim Elmadany, Sherif Abdou, Mervat Gheith

    Abstract: The ability to model and automatically detect dialogue act is an important step toward understanding spontaneous speech and Instant Messages. However, it has been difficult to infer a dialogue act from a surface utterance because it highly depends on the context of the utterance and speaker linguistic knowledge; especially in Arabic dialects. This paper proposes a statistical dialogue analysis mod… ▽ More

    Submitted 30 May, 2018; originally announced June 2018.

    Journal ref: 11th edition of the Language Resources and Evaluation Conference, 7-12 May 2018, Miyazaki (Japan)

  36. Towards Understanding Egyptian Arabic Dialogues

    Authors: Abdelrahim A Elmadany, Sherif M Abdou, Mervat Gheith

    Abstract: Labelling of user's utterances to understanding his attends which called Dialogue Act (DA) classification, it is considered the key player for dialogue language understanding layer in automatic dialogue systems. In this paper, we proposed a novel approach to user's utterances labeling for Egyptian spontaneous dialogues and Instant Messages using Machine Learning (ML) approach without relying on an… ▽ More

    Submitted 13 July, 2015; originally announced September 2015.

    Comments: arXiv admin note: substantial text overlap with arXiv:1505.03081

    Journal ref: International Journal of Computer Applications 120(220, PP 7-12, June 2015

  37. arXiv:1505.04197  [pdf

    cs.CL

    Arabic Inquiry-Answer Dialogue Acts Annotation Schema

    Authors: AbdelRahim A. Elmadany, Sherif M. Abdou, Mervat Gheith

    Abstract: We present an annotation schema as part of an effort to create a manually annotated corpus for Arabic dialogue language understanding including spoken dialogue and written "chat" dialogue for inquiry-answer domain. The proposed schema handles mainly the request and response acts that occurs frequently in inquiry-answer debate conversations expressing request services, suggests, and offers. We appl… ▽ More

    Submitted 15 May, 2015; originally announced May 2015.

    Comments: IOSR Journal of Engineering (IOSRJEN),Vol. 04, Issue 12 (December 2014),V2. arXiv admin note: text overlap with arXiv:1505.03084

  38. A Survey of Arabic Dialogues Understanding for Spontaneous Dialogues and Instant Message

    Authors: AbdelRahim A. Elmadany, Sherif M. Abdou, Mervat Gheith

    Abstract: Building dialogues systems interaction has recently gained considerable attention, but most of the resources and systems built so far are tailored to English and other Indo-European languages. The need for designing systems for other languages is increasing such as Arabic language. For this reasons, there are more interest for Arabic dialogue acts classification task because it a key player in Ara… ▽ More

    Submitted 12 May, 2015; originally announced May 2015.

    Journal ref: International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,April 2015

  39. Turn Segmentation into Utterances for Arabic Spontaneous Dialogues and Instance Messages

    Authors: AbdelRahim A. Elmadany, Sherif M. Abdou, Mervat Gheith

    Abstract: Text segmentation task is an essential processing task for many of Natural Language Processing (NLP) such as text summarization, text translation, dialogue language understanding, among others. Turns segmentation considered the key player in dialogue understanding task for building automatic Human-Computer systems. In this paper, we introduce a novel approach to turn segmentation into utterances f… ▽ More

    Submitted 12 May, 2015; originally announced May 2015.

    Journal ref: International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,April 2015