-
How to Use K-means for Big Data Clustering?
Authors:
Rustam Mussabayev,
Nenad Mladenovic,
Bassem Jarboui,
Ravil Mussabayev
Abstract:
K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algor…
▽ More
K-means plays a vital role in data mining and is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model. However, its performance drastically drops when applied to vast amounts of data. Therefore, it is crucial to improve K-means by scaling it to big data using as few of the following computational resources as possible: data, time, and algorithmic ingredients. We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering that satisfies the properties of a ``true big data'' algorithm and outperforms the classical and recent state-of-the-art MSSC approaches in terms of solution quality and runtime. The new approach naturally implements global search by decomposing the MSSC problem without using additional metaheuristics. This work shows that data decomposition is the basic approach to solve the big data clustering problem. The empirical success of the new algorithm allowed us to challenge the common belief that more data is required to obtain a good clustering solution. Moreover, the present work questions the established trend that more sophisticated hybrid approaches and algorithms are required to obtain a better clustering solution.
△ Less
Submitted 23 November, 2023; v1 submitted 14 April, 2022;
originally announced April 2022.
-
Towards an intelligent VNS heuristic for the k-labelled spanning forest problem
Authors:
Sergio Consoli,
Josè Andrès Moreno Pèrez,
Nenad Mladenovic
Abstract:
In a currently ongoing project, we investigate a new possibility for solving the k-labelled spanning forest (kLSF) problem by an intelligent Variable Neighbourhood Search (Int-VNS) metaheuristic. In the kLSF problem we are given an undirected input graph G and an integer positive value k, and the aim is to find a spanning forest of G having the minimum number of connected components and the upper…
▽ More
In a currently ongoing project, we investigate a new possibility for solving the k-labelled spanning forest (kLSF) problem by an intelligent Variable Neighbourhood Search (Int-VNS) metaheuristic. In the kLSF problem we are given an undirected input graph G and an integer positive value k, and the aim is to find a spanning forest of G having the minimum number of connected components and the upper bound k on the number of labels to use. The problem is related to the minimum labelling spanning tree (MLST) problem, whose goal is to get the spanning tree of the input graph with the minimum number of labels, and has several applications in the real world, where one aims to ensure connectivity by means of homogeneous connections. The Int-VNS metaheuristic that we propose for the kLSF problem is derived from the promising intelligent VNS strategy recently proposed for the MLST problem, and integrates the basic VNS for the kLSF problem with other complementary approaches from machine learning, statistics and experimental algorithmics, in order to produce high-quality performance and to completely automate the resulting strategy.
△ Less
Submitted 5 March, 2015;
originally announced March 2015.
-
BVNS para el problema del bosque generador k-etiquetado
Authors:
Sergio Consoli,
Nenad Mladenovìc,
Josè A. Moreno-Pèrez
Abstract:
In this paper we propose an efficient solution for the problem of generating k-labeling forest VNS. This problem is an extension of the Minimum Spanning Tree Problem Labelling problem with important applications in telecommunications networks and multimodal transport. It is, given an undirected graph whose links are labeled, and an integer positive number k, find the spanning forest with the lowes…
▽ More
In this paper we propose an efficient solution for the problem of generating k-labeling forest VNS. This problem is an extension of the Minimum Spanning Tree Problem Labelling problem with important applications in telecommunications networks and multimodal transport. It is, given an undirected graph whose links are labeled, and an integer positive number k, find the spanning forest with the lowest number of connected components using at most k different labels. To address the problem a Basic Variable Neighbourhood Search is proposed where the maximum amplitude of the neighbourhood space, n, is a key parameter. Different strategies are studied to establish the value of n. BVNS with the best selected strategy is experimentally compared with other metaheuristics that have appeared in the literature applied to this type of problem.
△ Less
Submitted 4 March, 2015;
originally announced March 2015.
-
Mejora de la exploracion y la explotacion de las heuristicas constructivas para el MLSTP
Authors:
Sergio Consoli,
Jose Andres Moreno-Perez,
Kenneth Darby-Dowman,
Nenad Mladenovic
Abstract:
This paper studies constructive heuristics for the minimum labelling spanning tree (MLST) problem. The purpose is to find a spanning tree that uses edges that are as similar as possible. Given an undirected labeled connected graph (i.e., with a label or color for each edge), the minimum labeling spanning tree problem seeks a spanning tree whose edges have the smallest possible number of distinct l…
▽ More
This paper studies constructive heuristics for the minimum labelling spanning tree (MLST) problem. The purpose is to find a spanning tree that uses edges that are as similar as possible. Given an undirected labeled connected graph (i.e., with a label or color for each edge), the minimum labeling spanning tree problem seeks a spanning tree whose edges have the smallest possible number of distinct labels. The model can represent many real-world problems in telecommunication networks, electric networks, and multimodal transportation networks, among others, and the problem has been shown to be NP-complete even for complete graphs. A primary heuristic, named the maximum vertex covering algorithm has been proposed. Several versions of this constructive heuristic have been proposed to improve its efficiency. Here we describe the problem, review the literature and compare some variants of this algorithm.
△ Less
Submitted 16 April, 2014;
originally announced May 2014.
-
Solving the minimum labelling spanning tree problem using intelligent optimization
Authors:
Sergio Consoli,
Nenad Mladenovic,
Jose Andres Moreno-Perez
Abstract:
Given a connected, undirected graph whose edges are labelled (or coloured), the minimum labelling spanning tree (MLST) problem seeks a spanning tree whose edges have the smallest number of distinct labels (or colours). In recent work, the MLST problem has been shown to be NP-hard and some effective heuristics have been proposed and analyzed. In this paper we present an intelligent optimization alg…
▽ More
Given a connected, undirected graph whose edges are labelled (or coloured), the minimum labelling spanning tree (MLST) problem seeks a spanning tree whose edges have the smallest number of distinct labels (or colours). In recent work, the MLST problem has been shown to be NP-hard and some effective heuristics have been proposed and analyzed. In this paper we present an intelligent optimization algorithm to solve the problem. It is obtained by the basic Variable Neighbourhood Search heuristic with the integration of other complements from machine learning, statistics and experimental algorithmics, in order to produce high-quality performance and to completely automate the resulting optimization strategy. We present experimental results on randomly generated graphs with different statistical properties, showing the crucial effects of the implementation, the robustness, and the empirical scalability of our intelligent algorithm. Furthermore, the computational experiments show that the proposed strategy outperforms the heuristics recommended in the literature and is able to obtain optimal or near-optimal solutions in short computational running time.
△ Less
Submitted 3 March, 2014; v1 submitted 11 January, 2012;
originally announced January 2012.