-
Extracting Arabic Relations from the Web
Authors:
Shimaa M. Abd El-salam,
Enas M. F. El Houby,
A. K. Al Sammak,
T. A. El-Shishtawy
Abstract:
The goal of this research is to extract a large list or table from named entities and relations in a specific domain. A small set of a handful of instance relations is required as input from the user. The system exploits summaries from Google search engine as a source text. These instances are used to extract patterns. The output is a set of new entities and their relations. The results from four…
▽ More
The goal of this research is to extract a large list or table from named entities and relations in a specific domain. A small set of a handful of instance relations is required as input from the user. The system exploits summaries from Google search engine as a source text. These instances are used to extract patterns. The output is a set of new entities and their relations. The results from four experiments show that precision and recall varies according to relation type. Precision ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83. The best result is obtained for (player, club) relationship, 0.72 and 0.83 for precision and recall respectively.
△ Less
Submitted 8 March, 2016;
originally announced March 2016.
-
Keyphrase Based Evaluation of Automatic Text Summarization
Authors:
Fatma Elghannam,
Tarek El-Shishtawy
Abstract:
The development of methods to deal with the informative contents of the text units in the matching process is a major challenge in automatic summary evaluation systems that use fixed n-gram matching. The limitation causes inaccurate matching between units in a peer and reference summaries. The present study introduces a new Keyphrase based Summary Evaluator KpEval for evaluating automatic summarie…
▽ More
The development of methods to deal with the informative contents of the text units in the matching process is a major challenge in automatic summary evaluation systems that use fixed n-gram matching. The limitation causes inaccurate matching between units in a peer and reference summaries. The present study introduces a new Keyphrase based Summary Evaluator KpEval for evaluating automatic summaries. The KpEval relies on the keyphrases since they convey the most important concepts of a text. In the evaluation process, the keyphrases are used in their lemma form as the matching text unit. The system was applied to evaluate different summaries of Arabic multi-document data set presented at TAC2011. The results showed that the new evaluation technique correlates well with the known evaluation systems: Rouge1, Rouge2, RougeSU4, and AutoSummENG MeMoG. KpEval has the strongest correlation with AutoSummENG MeMoG, Pearson and spearman correlation coefficient measures are 0.8840, 0.9667 respectively.
△ Less
Submitted 22 May, 2015;
originally announced May 2015.
-
The Best Templates Match Technique For Example Based Machine Translation
Authors:
T. El-Shishtawy,
A. El-Sammak
Abstract:
It has been proved that large scale realistic Knowledge Based Machine Translation applications require acquisition of huge knowledge about language and about the world. This knowledge is encoded in computational grammars, lexicons and domain models. Another approach which avoids the need for collecting and analyzing massive knowledge, is the Example Based approach, which is the topic of this paper…
▽ More
It has been proved that large scale realistic Knowledge Based Machine Translation applications require acquisition of huge knowledge about language and about the world. This knowledge is encoded in computational grammars, lexicons and domain models. Another approach which avoids the need for collecting and analyzing massive knowledge, is the Example Based approach, which is the topic of this paper. We show through the paper that using Example Based in its native form is not suitable for translating into Arabic. Therefore a modification to the basic approach is presented to improve the accuracy of the translation process. The basic idea of the new approach is to improve the technique by which template-based approaches select the appropriate templates.
△ Less
Submitted 4 June, 2014;
originally announced June 2014.
-
A Mobile Management System for Reforming Subsidies Distribution in Developing Countries
Authors:
T. El-Shishtawy
Abstract:
This paper has a specific objective of being useful for showing how the advances in mobile technologies can help for solving social and political aspects involved in the reform of subsidies in developing countries. It describes the work done to build a mobile-based supportive network that integrates all subsidies partners: governmental, non-governmental organizations, merchants, and beneficiaries.…
▽ More
This paper has a specific objective of being useful for showing how the advances in mobile technologies can help for solving social and political aspects involved in the reform of subsidies in developing countries. It describes the work done to build a mobile-based supportive network that integrates all subsidies partners: governmental, non-governmental organizations, merchants, and beneficiaries. One main contribution of this work is the setting of a framework for identifying the requirements of subsidies distribution information systems. In the proposed approach, seven domains were identified to build a Mobile Subsidizing Business Model (MSBM). Based on MSBM, detailed requirements were gathered in three stages, with each having its appropriate methodology. In this work, we focus on the layered architecture implementation of the subsidizing mobile system to breakdown the complexities, which are due to variations of mobile technologies, different business rules, and multiple distribution scenarios.
△ Less
Submitted 30 April, 2014;
originally announced May 2014.
-
A Lemma Based Evaluator for Semitic Language Text Summarization Systems
Authors:
Tarek El-Shishtawy,
Fatma El-Ghannam
Abstract:
Matching texts in highly inflected languages such as Arabic by simple stemming strategy is unlikely to perform well. In this paper, we present a strategy for automatic text matching technique for for inflectional languages, using Arabic as the test case. The system is an extension of ROUGE test in which texts are matched on token's lemma level. The experimental results show an enhancement of detec…
▽ More
Matching texts in highly inflected languages such as Arabic by simple stemming strategy is unlikely to perform well. In this paper, we present a strategy for automatic text matching technique for for inflectional languages, using Arabic as the test case. The system is an extension of ROUGE test in which texts are matched on token's lemma level. The experimental results show an enhancement of detecting similarities between different sentences having same semantics but written in different lexical forms..
△ Less
Submitted 21 March, 2014;
originally announced March 2014.
-
Multi-Topic Multi-Document Summarizer
Authors:
Fatma El-Ghannam,
Tarek El-Shishtawy
Abstract:
Current multi-document summarization systems can successfully extract summary sentences, however with many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and poor coherence among the selected sentences. The present study introduces a new concept of centroid approach and reports new techniques for extracting summary sentences for multi-document. In bot…
▽ More
Current multi-document summarization systems can successfully extract summary sentences, however with many limitations including: low coverage, inaccurate extraction to important sentences, redundancy and poor coherence among the selected sentences. The present study introduces a new concept of centroid approach and reports new techniques for extracting summary sentences for multi-document. In both techniques keyphrases are used to weigh sentences and documents. The first summarization technique (Sen-Rich) prefers maximum richness sentences. While the second (Doc-Rich), prefers sentences from centroid document. To demonstrate the new summarization system application to extract summaries of Arabic documents we performed two experiments. First, we applied Rouge measure to compare the new techniques among systems presented at TAC2011. The results show that Sen-Rich outperformed all systems in ROUGE-S. Second, the system was applied to summarize multi-topic documents. Using human evaluators, the results show that Doc-Rich is the superior, where summary sentences characterized by extra coverage and more cohesion.
△ Less
Submitted 3 January, 2014;
originally announced January 2014.
-
A Hybrid Algorithm for Matching Arabic Names
Authors:
T. El-Shishtawy
Abstract:
In this paper, a new hybrid algorithm which combines both of token-based and character-based approaches is presented. The basic Levenshtein approach has been extended to token-based distance metric. The distance metric is enhanced to set the proper granularity level behavior of the algorithm. It smoothly maps a threshold of misspellings differences at the character level, and the importance of tok…
▽ More
In this paper, a new hybrid algorithm which combines both of token-based and character-based approaches is presented. The basic Levenshtein approach has been extended to token-based distance metric. The distance metric is enhanced to set the proper granularity level behavior of the algorithm. It smoothly maps a threshold of misspellings differences at the character level, and the importance of token level errors in terms of token's position and frequency. Using a large Arabic dataset, the experimental results show that the proposed algorithm overcomes successfully many types of errors such as: typographical errors, omission or insertion of middle name components, omission of non-significant popular name components, and different writing styles character variations. When compared the results with other classical algorithms, using the same dataset, the proposed algorithm was found to increase the minimum success level of best tested algorithms, while achieving higher upper limits .
△ Less
Submitted 22 September, 2013;
originally announced September 2013.
-
Keyphrase Based Arabic Summarizer (KPAS)
Authors:
Tarek El-Shishtawy,
Fatma El-Ghannam
Abstract:
This paper describes a computationally inexpensive and efficient generic summarization algorithm for Arabic texts. The algorithm belongs to extractive summarization family, which reduces the problem into representative sentences identification and extraction sub-problems. Important keyphrases of the document to be summarized are identified employing combinations of statistical and linguistic featu…
▽ More
This paper describes a computationally inexpensive and efficient generic summarization algorithm for Arabic texts. The algorithm belongs to extractive summarization family, which reduces the problem into representative sentences identification and extraction sub-problems. Important keyphrases of the document to be summarized are identified employing combinations of statistical and linguistic features. The sentence extraction algorithm exploits keyphrases as the primary attributes to rank a sentence. The present experimental work, demonstrates different techniques for achieving various summarization goals including: informative richness, coverage of both main and auxiliary topics, and keeping redundancy to a minimum. A scoring scheme is then adopted that balances between these summarization goals. To evaluate the resulted Arabic summaries with well-established systems, aligned English/Arabic texts are used through the experiments.
△ Less
Submitted 23 June, 2012;
originally announced June 2012.
-
Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques
Authors:
Tarek El-shishtawy,
Abdulwahab Al-sammak
Abstract:
In this paper, a supervised learning technique for extracting keyphrases of Arabic documents is presented. The extractor is supplied with linguistic knowledge to enhance its efficiency instead of relying only on statistical information such as term frequency and distance. During analysis, an annotated Arabic corpus is used to extract the required lexical features of the document words. The knowled…
▽ More
In this paper, a supervised learning technique for extracting keyphrases of Arabic documents is presented. The extractor is supplied with linguistic knowledge to enhance its efficiency instead of relying only on statistical information such as term frequency and distance. During analysis, an annotated Arabic corpus is used to extract the required lexical features of the document words. The knowledge also includes syntactic rules based on part of speech tags and allowed word sequences to extract the candidate keyphrases. In this work, the abstract form of Arabic words is used instead of its stem form to represent the candidate terms. The Abstract form hides most of the inflections found in Arabic words. The paper introduces new features of keyphrases based on linguistic knowledge, to capture titles and subtitles of a document. A simple ANOVA test is used to evaluate the validity of selected features. Then, the learning model is built using the LDA - Linear Discriminant Analysis - and training documents. Although, the presented system is trained using documents in the IT domain, experiments carried out show that it has a significantly better performance than the existing Arabic extractor systems, where precision and recall values reach double their corresponding values in the other systems especially for lengthy and non-scientific articles.
△ Less
Submitted 20 March, 2012;
originally announced March 2012.
-
An Accurate Arabic Root-Based Lemmatizer for Information Retrieval Purposes
Authors:
Tarek El-Shishtawy,
Fatma El-Ghannam
Abstract:
In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma level analysis and generation does not yet focused in Arabic NLP literatures. In the current research, we propose the first non-statistical accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems. The proposed lemmatizer makes use of different Arabic language knowledge resources to gene…
▽ More
In spite of its robust syntax, semantic cohesion, and less ambiguity, lemma level analysis and generation does not yet focused in Arabic NLP literatures. In the current research, we propose the first non-statistical accurate Arabic lemmatizer algorithm that is suitable for information retrieval (IR) systems. The proposed lemmatizer makes use of different Arabic language knowledge resources to generate accurate lemma form and its relevant features that support IR purposes. As a POS tagger, the experimental results show that, the proposed algorithm achieves a maximum accuracy of 94.8%. For first seen documents, an accuracy of 89.15% is achieved, compared to 76.7% of up to date Stanford accurate Arabic model, for the same, dataset.
△ Less
Submitted 15 March, 2012;
originally announced March 2012.