-
Probing the statistical properties of enriched co-occurrence networks
Authors:
Diego R. Amancio,
Jeaneth Machicao,
Laura V. C. Quispe
Abstract:
Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of te…
▽ More
Recent studies have explored the addition of virtual edges to word co-occurrence networks using word embeddings to enhance graph representations, particularly for short texts. While these enriched networks have demonstrated some success, the impact of incorporating semantic edges into traditional co-occurrence networks remains uncertain. This study investigates two key statistical properties of text-based network models. First, we assess whether network metrics can effectively distinguish between meaningless and meaningful texts. Second, we analyze whether these metrics are more sensitive to syntactic or semantic aspects of the text. Our results show that incorporating virtual edges can have positive and negative effects, depending on the specific network metric. For instance, the informativeness of the average shortest path and closeness centrality improves in short texts, while the clustering coefficient's informativeness decreases as more virtual edges are added. Additionally, we found that including stopwords affects the statistical properties of enriched networks. Our results can serve as a guideline for determining which network metrics are most appropriate for specific applications, depending on the typical text size and the nature of the problem.
△ Less
Submitted 3 December, 2024;
originally announced December 2024.
-
Analyzing the relationship between text features and research proposal productivity
Authors:
Jorge A. V. Tohalino,
Laura V. C. Quispe,
Diego R. Amancio
Abstract:
Predicting the output of research grants is of considerable relevance to research funding bodies, scientific entities and government agencies. In this study, we investigate whether text features extracted from projects title and abstracts are able to identify productive grants. Our analysis was conducted in three distinct areas, namely Medicine, Dentistry and Veterinary Medicine. Topical and compl…
▽ More
Predicting the output of research grants is of considerable relevance to research funding bodies, scientific entities and government agencies. In this study, we investigate whether text features extracted from projects title and abstracts are able to identify productive grants. Our analysis was conducted in three distinct areas, namely Medicine, Dentistry and Veterinary Medicine. Topical and complexity text features were used to identify predictors of productivity. The results indicate that there is a statistically significant relationship between text features and grants productivity, however such a dependence is weak. A feature relevance analysis revealed that the abstract text length and metrics derived from lexical diversity are among the most discriminative features. We also found that the prediction accuracy has a dependence on the considered project language and that topical features are more discriminative than text complexity measurements. Our findings suggest that text features should be used in combination with other features to assist the identification of relevant research ideas.
△ Less
Submitted 26 December, 2020; v1 submitted 17 May, 2020;
originally announced May 2020.
-
Using word embeddings to improve the discriminability of co-occurrence text networks
Authors:
Laura V. C. Quispe,
Jorge A. V. Tohalino,
Diego R. Amancio
Abstract:
Word co-occurrence networks have been employed to analyze texts both in the practical and theoretical scenarios. Despite the relative success in several applications, traditional co-occurrence networks fail in establishing links between similar words whenever they appear distant in the text. Here we investigate whether the use of word embeddings as a tool to create virtual links in co-occurrence n…
▽ More
Word co-occurrence networks have been employed to analyze texts both in the practical and theoretical scenarios. Despite the relative success in several applications, traditional co-occurrence networks fail in establishing links between similar words whenever they appear distant in the text. Here we investigate whether the use of word embeddings as a tool to create virtual links in co-occurrence networks may improve the quality of classification systems. Our results revealed that the discriminability in the stylometry task is improved when using Glove, Word2Vec and FastText. In addition, we found that optimized results are obtained when stopwords are not disregarded and a simple global thresholding strategy is used to establish virtual links. Because the proposed approach is able to improve the representation of texts as complex networks, we believe that it could be extended to study other natural language processing tasks. Likewise, theoretical languages studies could benefit from the adopted enriched representation of word co-occurrence networks.
△ Less
Submitted 13 March, 2020;
originally announced March 2020.