-
A Clustering-Based Combinatorial Approach to Unsupervised Matching of Product Titles
Authors:
Leonidas Akritidis,
Athanasios Fevgas,
Panayiotis Bozanis,
Christos Makris
Abstract:
The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standa…
▽ More
The constant growth of the e-commerce industry has rendered the problem of product retrieval particularly important. As more enterprises move their activities on the Web, the volume and the diversity of the product-related information increase quickly. These factors make it difficult for the users to identify and compare the features of their desired products. Recent studies proved that the standard similarity metrics cannot effectively identify identical products, since similar titles often refer to different products and vice-versa. Other studies employed external data sources (search engines) to enrich the titles; these solutions are rather impractical mainly because the external data fetching is slow. In this paper we introduce UPM, an unsupervised algorithm for matching products by their titles. UPM is independent of any external sources, since it analyzes the titles and extracts combinations of words out of them. These combinations are evaluated according to several criteria, and the most appropriate of them constitutes the cluster where a product is classified into. UPM is also parameter-free, it avoids product pairwise comparisons, and includes a post-processing verification stage which corrects the erroneous matches. The experimental evaluation of UPM demonstrated its superiority against the state-of-the-art approaches in terms of both efficiency and effectiveness.
△ Less
Submitted 6 March, 2019;
originally announced March 2019.
-
On converting community detection algorithms for fuzzy graphs in Neo4j
Authors:
Georgios Drakopoulos,
Andreas Kanavos,
Christos Makris,
Vasileios Megalooikonomou
Abstract:
An essential feature of large scale free graphs, such as the Web, protein-to-protein interaction, brain connectivity, and social media graphs, is that they tend to form recursive communities. The latter are densely connected vertex clusters exhibiting quick local information dissemination and processing. Under the fuzzy graph model vertices are fixed while each edge exists with a given probability…
▽ More
An essential feature of large scale free graphs, such as the Web, protein-to-protein interaction, brain connectivity, and social media graphs, is that they tend to form recursive communities. The latter are densely connected vertex clusters exhibiting quick local information dissemination and processing. Under the fuzzy graph model vertices are fixed while each edge exists with a given probability according to a membership function. This paper presents Fuzzy Walktrap and Fuzzy Newman-Girvan, fuzzy versions of two established community discovery algorithms. The proposed algorithms have been applied to a synthetic graph generated by the Kronecker model with different termination criteria and the results are discussed.
△ Less
Submitted 22 February, 2017; v1 submitted 7 August, 2016;
originally announced August 2016.
-
The Storage And Analytics Potential Of HBase Over The Cloud: A Survey
Authors:
Georgios Drakopoulos,
Andreas Kanavos,
Christos Makris,
Vasileios Megalooikonomou
Abstract:
Apache HBase, a mainstay of the emerging Hadoop ecosystem, is a NoSQL key-value and column family hybrid database which, unlike a traditional RDBMS, is intentionally designed to scalably host large, semistructured, and heterogeneous data. Prime examples of such data are biosignals which are characterized by large volume, high volatility, and inherent multidimensionality. This paper reviews how bio…
▽ More
Apache HBase, a mainstay of the emerging Hadoop ecosystem, is a NoSQL key-value and column family hybrid database which, unlike a traditional RDBMS, is intentionally designed to scalably host large, semistructured, and heterogeneous data. Prime examples of such data are biosignals which are characterized by large volume, high volatility, and inherent multidimensionality. This paper reviews how biomedical engineering has recently taken advantage of HBase, with an emphasis over cloud, in order to reliably host cardiovascular and respiratory time series. Moreover, the deployment of offline biomedical analytics over HBase is explored.
△ Less
Submitted 22 February, 2017; v1 submitted 2 August, 2016;
originally announced August 2016.
-
Large Graph Models: A Review
Authors:
Georgios Drakopoulos,
Stavros Kontopoulos,
Christos Makris,
Vasileios Megalooikonomou
Abstract:
Large graphs can be found in a wide array of scientific fields ranging from sociology and biology to scientometrics and computer science. Their analysis is by no means a trivial task due to their sheer size and complex structure. Such structure encompasses features so diverse as diameter shrinking, power law degree distribution and self similarity, edge interdependence, and communities. When the a…
▽ More
Large graphs can be found in a wide array of scientific fields ranging from sociology and biology to scientometrics and computer science. Their analysis is by no means a trivial task due to their sheer size and complex structure. Such structure encompasses features so diverse as diameter shrinking, power law degree distribution and self similarity, edge interdependence, and communities. When the adjacency matrix of a graph is considered, then new, spectral properties arise such as primary eigenvalue component decay function, eigenvalue decay function, eigenvalue sign alternation around zero, and spectral gap. Graph mining is the scientific field which attempts to extract information and knowledge from graphs through their structural and spectral properties. Graph modeling is the associated field of generating synthetic graphs with properties similar to those of real graphs in order to simulate the latter. Such simulations may be desirable because of privacy concerns, cost, or lack of access to real data. Pivotal to simulation are low- and high-level software packages offering graph analysis and visualization capabilities. This survey outlines the most important structural and spectral graph properties, a considerable number of graph models, as well the most common graph mining and graph learning tools.
△ Less
Submitted 22 February, 2017; v1 submitted 24 January, 2016;
originally announced January 2016.
-
Code Quality Evaluation Methodology Using The ISO/IEC 9126 Standard
Authors:
Yiannis Kanellopoulos,
Panos Antonellis,
Dimitris Antoniou,
Christos Makris,
Evangelos Theodoridis,
Christos Tjortjis,
Nikos Tsirakis
Abstract:
This work proposes a methodology for source code quality and static behaviour evaluation of a software system, based on the standard ISO/IEC-9126. It uses elements automatically derived from source code enhanced with expert knowledge in the form of quality characteristic rankings, allowing software engineers to assign weights to source code attributes. It is flexible in terms of the set of metrics…
▽ More
This work proposes a methodology for source code quality and static behaviour evaluation of a software system, based on the standard ISO/IEC-9126. It uses elements automatically derived from source code enhanced with expert knowledge in the form of quality characteristic rankings, allowing software engineers to assign weights to source code attributes. It is flexible in terms of the set of metrics and source code attributes employed, even in terms of the ISO/IEC-9126 characteristics to be assessed. We applied the methodology to two case studies, involving five open source and one proprietary system. Results demonstrated that the methodology can capture software quality trends and express expert perceptions concerning system quality in a quantitative and systematic manner.
△ Less
Submitted 29 July, 2010;
originally announced July 2010.