DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings
Authors:
Muhammad Abdul-Mageed,
Shady Elbassuoni,
Jad Doughman,
AbdelRahim Elmadany,
El Moatez Billah Nagoudi,
Yorgo Zoughby,
Ahmad Shaher,
Iskander Gaba,
Ahmed Helal,
Mohammed El-Razzaz
Abstract:
Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntac…
▽ More
Word embeddings are a core component of modern natural language processing systems, making the ability to thoroughly evaluate them a vital task. We describe DiaLex, a benchmark for intrinsic evaluation of dialectal Arabic word embedding. DiaLex covers five important Arabic dialects: Algerian, Egyptian, Lebanese, Syrian, and Tunisian. Across these dialects, DiaLex provides a testbank for six syntactic and semantic relations, namely male to female, singular to dual, singular to plural, antonym, comparative, and genitive to past tense. DiaLex thus consists of a collection of word pairs representing each of the six relations in each of the five dialects. To demonstrate the utility of DiaLex, we use it to evaluate a set of existing and new Arabic word embeddings that we developed. Our benchmark, evaluation code, and new word embedding models will be publicly available.
△ Less
Submitted 12 March, 2021; v1 submitted 22 November, 2020;
originally announced November 2020.