-
A comprehensive study on Frequent Pattern Mining and Clustering categories for topic detection in Persian text stream
Authors:
Elnaz Zafarani-Moattar,
Mohammad Reza Kangavari,
Amir Masoud Rahmani
Abstract:
Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessa…
▽ More
Topic detection is a complex process and depends on language because it somehow needs to analyze text. There have been few studies on topic detection in Persian, and the existing algorithms are not remarkable. Therefore, we aimed to study topic detection in Persian. The objectives of this study are: 1) to conduct an extensive study on the best algorithms for topic detection, 2) to identify necessary adaptations to make these algorithms suitable for the Persian language, and 3) to evaluate their performance on Persian social network texts. To achieve these objectives, we have formulated two research questions: First, considering the lack of research in Persian, what modifications should be made to existing frameworks, especially those developed in English, to make them compatible with Persian? Second, how do these algorithms perform, and which one is superior? There are various topic detection methods that can be categorized into different categories. Frequent pattern and clustering are selected for this research, and a hybrid of both is proposed as a new category. Then, ten methods from these three categories are selected. All of them are re-implemented from scratch, changed, and adapted with Persian. These ten methods encompass different types of topic detection methods and have shown good performance in English. The text of Persian social network posts is used as the dataset. Additionally, a new multiclass evaluation criterion, called FS, is used in this paper for the first time in the field of topic detection. Approximately 1.4 billion tokens are processed during experiments. The results indicate that if we are searching for keyword-topics that are easily understandable by humans, the hybrid category is better. However, if the aim is to cluster posts for further analysis, the frequent pattern category is more suitable.
△ Less
Submitted 15 March, 2024;
originally announced March 2024.
-
A Comparative Study on Transfer Learning and Distance Metrics in Semantic Clustering over the COVID-19 Tweets
Authors:
Elnaz Zafarani-Moattar,
Mohammad Reza Kangavari,
Amir Masoud Rahmani
Abstract:
This paper is a comparison study in the context of Topic Detection on COVID-19 data. There are various approaches for Topic Detection, among which the Clustering approach is selected in this paper. Clustering requires distance and calculating distance needs embedding. The aim of this research is to simultaneously study the three factors of embedding methods, distance metrics and clustering methods…
▽ More
This paper is a comparison study in the context of Topic Detection on COVID-19 data. There are various approaches for Topic Detection, among which the Clustering approach is selected in this paper. Clustering requires distance and calculating distance needs embedding. The aim of this research is to simultaneously study the three factors of embedding methods, distance metrics and clustering methods and their interaction. A dataset including one-month tweets collected with COVID-19-related hashtags is used for this study. Five methods, from earlier to new methods, are selected among the embedding methods: Word2Vec, fastText, GloVe, BERT and T5. Five clustering methods are investigated in this paper that are: k-means, DBSCAN, OPTICS, spectral and Jarvis-Patrick. Euclidian distance and Cosine distance as the most important distance metrics in this field are also examined. First, more than 7,500 tests are performed to tune the parameters. Then, all the different combinations of embedding methods with distance metrics and clustering methods are investigated by silhouette metric. The number of these combinations is 50 cases. First, the results of these 50 tests are examined. Then, the rank of each method is taken into account in all the tests of that method. Finally, the major variables of the research (embedding methods, distance metrics and clustering methods) are studied separately. Averaging is performed over the control variables to neutralize their effect. The experimental results show that T5 strongly outperforms other embedding methods in terms of silhouette metric. In terms of distance metrics, cosine distance is weakly better. DBSCAN is also superior to other methods in terms of clustering methods.
△ Less
Submitted 16 November, 2021;
originally announced November 2021.
-
Phraseformer: Multimodal Key-phrase Extraction using Transformer and Graph Embedding
Authors:
Narjes Nikzad-Khasmakhi,
Mohammad-Reza Feizi-Derakhshi,
Meysam Asgari-Chenaghlu,
Mohammad-Ali Balafar,
Ali-Reza Feizi-Derakhshi,
Taymaz Rahkar-Farshi,
Majid Ramezani,
Zoleikha Jahanbakhsh-Nagadeh,
Elnaz Zafarani-Moattar,
Mehrdad Ranjbar-Khadivi
Abstract:
Background: Keyword extraction is a popular research topic in the field of natural language processing. Keywords are terms that describe the most relevant information in a document. The main problem that researchers are facing is how to efficiently and accurately extract the core keywords from a document. However, previous keyword extraction approaches have utilized the text and graph features, th…
▽ More
Background: Keyword extraction is a popular research topic in the field of natural language processing. Keywords are terms that describe the most relevant information in a document. The main problem that researchers are facing is how to efficiently and accurately extract the core keywords from a document. However, previous keyword extraction approaches have utilized the text and graph features, there is the lack of models that can properly learn and combine these features in a best way.
Methods: In this paper, we develop a multimodal Key-phrase extraction approach, namely Phraseformer, using transformer and graph embedding techniques. In Phraseformer, each keyword candidate is presented by a vector which is the concatenation of the text and structure learning representations. Phraseformer takes the advantages of recent researches such as BERT and ExEm to preserve both representations. Also, the Phraseformer treats the key-phrase extraction task as a sequence labeling problem solved using classification task.
Results: We analyze the performance of Phraseformer on three datasets including Inspec, SemEval2010 and SemEval 2017 by F1-score. Also, we investigate the performance of different classifiers on Phraseformer method over Inspec dataset. Experimental results demonstrate the effectiveness of Phraseformer method over the three datasets used. Additionally, the Random Forest classifier gain the highest F1-score among all classifiers.
Conclusions: Due to the fact that the combination of BERT and ExEm is more meaningful and can better represent the semantic of words. Hence, Phraseformer significantly outperforms single-modality methods.
△ Less
Submitted 9 June, 2021;
originally announced June 2021.
-
Automatic Personality Prediction; an Enhanced Method Using Ensemble Modeling
Authors:
Majid Ramezani,
Mohammad-Reza Feizi-Derakhshi,
Mohammad-Ali Balafar,
Meysam Asgari-Chenaghlu,
Ali-Reza Feizi-Derakhshi,
Narjes Nikzad-Khasmakhi,
Mehrdad Ranjbar-Khadivi,
Zoleikha Jahanbakhsh-Nagadeh,
Elnaz Zafarani-Moattar,
Taymaz Rahkar-Farshi
Abstract:
Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the pers…
▽ More
Human personality is significantly represented by those words which he/she uses in his/her speech or writing. As a consequence of spreading the information infrastructures (specifically the Internet and social media), human communications have reformed notably from face to face communication. Generally, Automatic Personality Prediction (or Perception) (APP) is the automated forecasting of the personality on different types of human generated/exchanged contents (like text, speech, image, video, etc.). The major objective of this study is to enhance the accuracy of APP from the text. To this end, we suggest five new APP methods including term frequency vector-based, ontology-based, enriched ontology-based, latent semantic analysis (LSA)-based, and deep learning-based (BiLSTM) methods. These methods as the base ones, contribute to each other to enhance the APP accuracy through ensemble modeling (stacking) based on a hierarchical attention network (HAN) as the meta-model. The results show that ensemble modeling enhances the accuracy of APP.
△ Less
Submitted 8 June, 2022; v1 submitted 9 July, 2020;
originally announced July 2020.
-
A Model to Measure the Spread Power of Rumors
Authors:
Zoleikha Jahanbakhsh-Nagadeh,
Mohammad-Reza Feizi-Derakhshi,
Majid Ramezani,
Taymaz Akan,
Meysam Asgari-Chenaghlu,
Narjes Nikzad-Khasmakhi,
Ali-Reza Feizi-Derakhshi,
Mehrdad Ranjbar-Khadivi,
Elnaz Zafarani-Moattar,
Mohammad-Ali Balafar
Abstract:
With technologies that have democratized the production and reproduction of information, a significant portion of daily interacted posts in social media has been infected by rumors. Despite the extensive research on rumor detection and verification, so far, the problem of calculating the spread power of rumors has not been considered. To address this research gap, the present study seeks a model t…
▽ More
With technologies that have democratized the production and reproduction of information, a significant portion of daily interacted posts in social media has been infected by rumors. Despite the extensive research on rumor detection and verification, so far, the problem of calculating the spread power of rumors has not been considered. To address this research gap, the present study seeks a model to calculate the Spread Power of Rumor (SPR) as the function of content-based features in two categories: False Rumor (FR) and True Rumor (TR). For this purpose, the theory of Allport and Postman will be adopted, which it claims that importance and ambiguity are the key variables in rumor-mongering and the power of rumor. Totally 42 content features in two categories "importance" (28 features) and "ambiguity" (14 features) are introduced to compute SPR. The proposed model is evaluated on two datasets, Twitter and Telegram. The results showed that (i) the spread power of False Rumor documents is rarely more than True Rumors. (ii) there is a significant difference between the SPR means of two groups False Rumor and True Rumor. (iii) SPR as a criterion can have a positive impact on distinguishing False Rumors and True Rumors.
△ Less
Submitted 17 June, 2022; v1 submitted 18 February, 2020;
originally announced February 2020.