-
Improving clustering quality evaluation in noisy Gaussian mixtures
Authors:
Renato Cordeiro de Amorim,
Vladimir Makarenkov
Abstract:
Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance iss…
▽ More
Clustering is a well-established technique in machine learning and data analysis, widely used across various domains. Cluster validity indices, such as the Average Silhouette Width, Calinski-Harabasz, and Davies-Bouldin indices, play a crucial role in assessing clustering quality when external ground truth labels are unavailable. However, these measures can be affected by the feature relevance issue, potentially leading to unreliable evaluations in high-dimensional or noisy data sets.
We introduce a theoretically grounded Feature Importance Rescaling (FIR) method that enhances the quality of clustering validation by adjusting feature contributions based on their dispersion. It attenuates noise features, clarifies clustering compactness and separation, and thereby aligns clustering validation more closely with the ground truth. Through extensive experiments on synthetic data sets under different configurations, we demonstrate that FIR consistently improves the correlation between the values of cluster validity indices and the ground truth, particularly in settings with noisy or irrelevant features.
The results show that FIR increases the robustness of clustering evaluation, reduces variability in performance across different data sets, and remains effective even when clusters exhibit significant overlap. These findings highlight the potential of FIR as a valuable enhancement of clustering validation, making it a practical tool for unsupervised learning tasks where labelled data is unavailable.
△ Less
Submitted 27 March, 2025; v1 submitted 1 March, 2025;
originally announced March 2025.
-
A Deep Learning Approach to Language-independent Gender Prediction on Twitter
Authors:
Reyhaneh Hashempour,
Barbara Plank,
Aline Villavicencio,
Renato Cordeiro de Amorim
Abstract:
This work presents a set of experiments conducted to predict the gender of Twitter users based on language-independent features extracted from the text of the users' tweets. The experiments were performed on a version of TwiSty dataset including tweets written by the users of six different languages: Portuguese, French, Dutch, English, German, and Italian. Logistic regression (LR), and feed-forwar…
▽ More
This work presents a set of experiments conducted to predict the gender of Twitter users based on language-independent features extracted from the text of the users' tweets. The experiments were performed on a version of TwiSty dataset including tweets written by the users of six different languages: Portuguese, French, Dutch, English, German, and Italian. Logistic regression (LR), and feed-forward neural networks (FFNN) with back-propagation were used to build models in two different settings: Inter-Lingual (IL) and Cross-Lingual (CL). In the IL setting, the training and testing were performed on the same language whereas in the CL, Italian and German datasets were set aside and only used as test sets and the rest were combined to compose training and development sets. In the IL, the highest accuracy score belongs to LR whereas in the CL, FFNN with three hidden layers yields the highest score. The results show that neural network based models underperform traditional models when the size of the training set is small; however, they beat traditional models by a non-trivial margin, when they are fed with large enough data. Finally, the feature analysis confirms that men and women have different writing styles independent of their language.
△ Less
Submitted 29 November, 2024;
originally announced November 2024.
-
Positional Encoder Graph Quantile Neural Networks for Geographic Data
Authors:
William E. R. de Amorim,
Scott A. Sisson,
T. Rodrigues,
David J. Nott,
Guilherme S. Rodrigues
Abstract:
Positional Encoder Graph Neural Networks (PE-GNNs) are among the most effective models for learning from continuous spatial data. However, their predictive distributions are often poorly calibrated, limiting their utility in applications that require reliable uncertainty quantification. We propose the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel framework that combines PE-GN…
▽ More
Positional Encoder Graph Neural Networks (PE-GNNs) are among the most effective models for learning from continuous spatial data. However, their predictive distributions are often poorly calibrated, limiting their utility in applications that require reliable uncertainty quantification. We propose the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel framework that combines PE-GNNs with Quantile Neural Networks, partially monotonic neural blocks, and post-hoc recalibration techniques. The PE-GQNN enables flexible and robust conditional density estimation with minimal assumptions about the target distribution, and it extends naturally to tasks beyond spatial data. Empirical results on benchmark datasets show that the PE-GQNN outperforms existing methods in both predictive accuracy and uncertainty quantification, without incurring additional computational cost. We also provide theoretical insights and identify important special cases arising from our formulation, including the PE-GNN.
△ Less
Submitted 15 May, 2025; v1 submitted 27 September, 2024;
originally announced September 2024.
-
Geo-Sketcher: Rapid 3D Geological Modeling using Geological and Topographic Map Sketches
Authors:
Ronan Amorim,
Emilio Vital Brazil,
Faramarz Samavati,
Mario Costa Sousa
Abstract:
The construction of 3D geological models is an essential task in oil/gas exploration, development and production. However, it is a cumbersome, time-consuming and error-prone task mainly because of the model's geometric and topological complexity. The models construction is usually separated into interpretation and 3D modeling, performed by different highly specialized individuals, which leads to i…
▽ More
The construction of 3D geological models is an essential task in oil/gas exploration, development and production. However, it is a cumbersome, time-consuming and error-prone task mainly because of the model's geometric and topological complexity. The models construction is usually separated into interpretation and 3D modeling, performed by different highly specialized individuals, which leads to inconsistencies and intensifies the challenges. In addition, the creation of models following geological rules is paramount for properly depicting static and dynamic properties of oil/gas reservoirs. In this work, we propose a sketch-based approach to expedite the creation of valid 3D geological models by mimicking how domain experts interpret geological structures, allowing creating models directly from interpretation sketches. Our sketch-based modeler (Geo-Sketcher) is based on sketches of standard 2D topographic and geological maps, comprised of lines, symbols and annotations. We developed a graph-based representation to enable (1) the automatic computation of the relative ages of rock series and layers, and (2) the embedding of specific geological rules directly in the sketching. We introduce the use of Hermite-Birkhoff Radial Basis Functions to interpolate the geological map constraints, and demonstrate the capabilities of our approach with a variety of results with different levels of complexity.
△ Less
Submitted 21 August, 2023;
originally announced August 2023.
-
Power Allocation for Uplink Communications of Massive Cellular-Connected UAVs
Authors:
Xuesong Cai,
István Z. Kovács,
Jeroen Wigard,
Rafhael Amorim,
Fredrik Tufvesson,
Preben E. Mogensen
Abstract:
Cellular-connected unmanned aerial vehicle (UAV) has attracted a surge of research interest in both academia and industry. To support aerial user equipment (UEs) in the existing cellular networks, one promising approach is to assign a portion of the system bandwidth exclusively to the UAV-UEs. This is especially favorable for use cases where a large number of UAV-UEs are exploited, e.g., for packa…
▽ More
Cellular-connected unmanned aerial vehicle (UAV) has attracted a surge of research interest in both academia and industry. To support aerial user equipment (UEs) in the existing cellular networks, one promising approach is to assign a portion of the system bandwidth exclusively to the UAV-UEs. This is especially favorable for use cases where a large number of UAV-UEs are exploited, e.g., for package delivery close to a warehouse. Although the nearly line-of-sight (LoS) channels can result in higher powers received, UAVs can in turn cause severe interference to each other in the same frequency band. In this contribution, we focus on the uplink communications of massive cellular-connected UAVs. Different power allocation algorithms are proposed to either maximize the minimal spectrum efficiency (SE) or maximize the overall SE to cope with severe interference based on the successive convex approximation (SCA) principle. One of the challenges is that a UAV can affect a large area meaning that many more UAV-UEs must be considered in the optimization problem, which is essentially different from that for terrestrial UEs. The necessity of single-carrier uplink transmission further complicates the problem. Nevertheless, we find that the special property of large coherent bandwidths and coherent times of the propagation channels can be leveraged. The performances of the proposed algorithms are evaluated via extensive simulations in the full-buffer transmission mode and bursty-traffic mode. Results show that the proposed algorithms can effectively enhance the uplink SEs. This work can be considered the first attempt to deal with the interference among massive cellular-connected UAV-UEs with optimized power allocations.
△ Less
Submitted 26 February, 2023; v1 submitted 25 July, 2021;
originally announced July 2021.
-
Improving cluster recovery with feature rescaling factors
Authors:
Renato Cordeiro de Amorim,
Vladimir Makarenkov
Abstract:
The data preprocessing stage is crucial in clustering. Features may describe entities using different scales. To rectify this, one usually applies feature normalisation aiming at rescaling features so that none of them overpowers the others in the objective function of the selected clustering algorithm. In this paper, we argue that the rescaling procedure should not treat all features identically.…
▽ More
The data preprocessing stage is crucial in clustering. Features may describe entities using different scales. To rectify this, one usually applies feature normalisation aiming at rescaling features so that none of them overpowers the others in the objective function of the selected clustering algorithm. In this paper, we argue that the rescaling procedure should not treat all features identically. Instead, it should favour the features that are more meaningful for clustering. With this in mind, we introduce a feature rescaling method that takes into account the within-cluster degree of relevance of each feature. Our comprehensive simulation study, carried out on real and synthetic data, with and without noise features, clearly demonstrates that clustering methods that use the proposed data normalization strategy clearly outperform those that use traditional data normalization.
△ Less
Submitted 1 December, 2020;
originally announced December 2020.
-
Identifying meaningful clusters in malware data
Authors:
Renato Cordeiro de Amorim,
Carlos David Lopez Ruiz
Abstract:
Finding meaningful clusters in drive-by-download malware data is a particularly difficult task. Malware data tends to contain overlapping clusters with wide variations of cardinality. This happens because there can be considerable similarity between malware samples (some are even said to belong to the same family), and these tend to appear in bursts. Clustering algorithms are usually applied to no…
▽ More
Finding meaningful clusters in drive-by-download malware data is a particularly difficult task. Malware data tends to contain overlapping clusters with wide variations of cardinality. This happens because there can be considerable similarity between malware samples (some are even said to belong to the same family), and these tend to appear in bursts. Clustering algorithms are usually applied to normalised data sets. However, the process of normalisation aims at setting features with different range values to have a similar contribution to the clustering. It does not favour more meaningful features over those that are less meaningful, an effect one should perhaps expect of the data pre-processing stage.
In this paper we introduce a method to deal precisely with the problem above. This is an iterative data pre-processing method capable of aiding to increase the separation between clusters. It does so by calculating the within-cluster degree of relevance of each feature, and then it uses these as a data rescaling factor. By repeating this until convergence our malware data was separated in clear clusters, leading to a higher average silhouette width.
△ Less
Submitted 31 July, 2020;
originally announced August 2020.
-
An efficient density-based clustering algorithm using reverse nearest neighbour
Authors:
Stiphen Chowdhury,
Renato Cordeiro de Amorim
Abstract:
Density-based clustering is the task of discovering high-density regions of entities (clusters) that are separated from each other by contiguous regions of low-density. DBSCAN is, arguably, the most popular density-based clustering algorithm. However, its cluster recovery capabilities depend on the combination of the two parameters. In this paper we present a new density-based clustering algorithm…
▽ More
Density-based clustering is the task of discovering high-density regions of entities (clusters) that are separated from each other by contiguous regions of low-density. DBSCAN is, arguably, the most popular density-based clustering algorithm. However, its cluster recovery capabilities depend on the combination of the two parameters. In this paper we present a new density-based clustering algorithm which uses reverse nearest neighbour (RNN) and has a single parameter. We also show that it is possible to estimate a good value for this parameter using a clustering validity index. The RNN queries enable our algorithm to estimate densities taking more than a single entity into account, and to recover clusters that are not well-separated or have different densities. Our experiments on synthetic and real-world data sets show our proposed algorithm outperforms DBSCAN and its recent variant ISDBSCAN.
△ Less
Submitted 19 November, 2018;
originally announced November 2018.
-
A-Ward_p\b{eta}: Effective hierarchical clustering using the Minkowski metric and a fast k -means initialisation
Authors:
Renato Cordeiro de Amorim,
Vladimir Makarenkov,
Boris Mirkin
Abstract:
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from…
▽ More
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. This allows the cluster merging process to start from this partition rather than from a trivial partition composed solely of singletons. Our second contribution is an extension of the Ward and Ward p algorithms to the situation where the feature weight exponent can differ from the exponent of the Minkowski distance. This new method, called A-Ward p\b{eta} , is able to generate a much wider variety of clustering solutions. We also demonstrate that its parameters can be estimated reasonably well by using a cluster validity index. We perform numerous experiments using data sets with two types of noise, insertion of noise features and blurring within-cluster values of some features. These experiments allow us to conclude: (i) our anomalous pattern initialisation method does indeed reduce the time a hierarchical clustering algorithm takes to complete, without negatively impacting its cluster recovery ability; (ii) A-Ward p\b{eta} provides better cluster recovery than both Ward and Ward p.
△ Less
Submitted 3 November, 2016;
originally announced November 2016.
-
Recovering the number of clusters in data sets with noise features using feature rescaling factors
Authors:
Renato Cordeiro de Amorim,
Christian Hennig
Abstract:
In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of re…
▽ More
In this paper we introduce three methods for re-scaling data sets aiming at improving the likelihood of clustering validity indexes to return the true number of spherical Gaussian clusters with additional noise features. Our method obtains feature re-scaling factors taking into account the structure of a given data set and the intuitive idea that different features may have different degrees of relevance at different clusters.
We experiment with the Silhouette (using squared Euclidean, Manhattan, and the p$^{th}$ power of the Minkowski distance), Dunn's, Calinski-Harabasz and Hartigan indexes on data sets with spherical Gaussian clusters with and without noise features. We conclude that our methods indeed increase the chances of estimating the true number of clusters in a data set.
△ Less
Submitted 22 February, 2016;
originally announced February 2016.
-
A survey on feature weighting based K-Means algorithms
Authors:
Renato Cordeiro de Amorim
Abstract:
In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be…
▽ More
In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means.
△ Less
Submitted 22 September, 2015;
originally announced January 2016.