-
Strong bounds for large-scale Minimum Sum-of-Squares Clustering
Authors:
Anna Livia Croella,
Veronica Piccialli,
Antonio M. Sudoso
Abstract:
Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering,…
▽ More
Clustering is a fundamental technique in data analysis and machine learning, used to group similar data points together. Among various clustering methods, the Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used. MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids. Due to the unsupervised nature of clustering, achieving global optimality is crucial, yet computationally challenging. The complexity of finding the global solution increases exponentially with the number of data points, making exact methods impractical for large-scale datasets. Even obtaining strong lower bounds on the optimal MSSC objective value is computationally prohibitive, making it difficult to assess the quality of heuristic solutions. We address this challenge by introducing a novel method to validate heuristic MSSC solutions through optimality gaps. Our approach employs a divide-and-conquer strategy, decomposing the problem into smaller instances that can be handled by an exact solver. The decomposition is guided by an auxiliary optimization problem, the "anticlustering problem", for which we design an efficient heuristic. Computational experiments demonstrate the effectiveness of the method for large-scale instances, achieving optimality gaps below 3% in most cases while maintaining reasonable computational times. These results highlight the practicality of our approach in assessing feasible clustering solutions for large datasets, bridging a critical gap in MSSC evaluation.
△ Less
Submitted 12 February, 2025;
originally announced February 2025.
-
Optimization for Evaluating the Practical Capacity of a Transshipment Yard
Authors:
Anna Russo Russo,
Roberto Mancini,
Gianpaolo Oriolo,
Veronica Piccialli,
Davide Ussai
Abstract:
In order to increase rail freight transportation in Italy, Rete Ferroviaria Italiana (RFI) the Italian railway infrastructure manager, is carrying out several investment plans to enhance the Transshipment Yards, that act as an interface between the rail and road networks. The need is to increase their practical capacity, i.e. the maximum number of train services that can be inserted without alteri…
▽ More
In order to increase rail freight transportation in Italy, Rete Ferroviaria Italiana (RFI) the Italian railway infrastructure manager, is carrying out several investment plans to enhance the Transshipment Yards, that act as an interface between the rail and road networks. The need is to increase their practical capacity, i.e. the maximum number of train services that can be inserted without altering the current timetable while respecting all relevant constraints. Several factors influence the practical capacity of a transshipment yard: physical resources (such as tracks and vehicles for loading/unloading); constraints on the possible time slots of individual operations; constraints on the length of time a train must stay in the yard, that follow from both timetable requirements that are settled by the (prevalent) main line and from administrative and organisational issues in the yard. In this paper, we propose a MILP-based optimization model that is based on the solution of a suitable saturation problem, that deals with all these constraints and that can be used for evaluating the practical capacity of a transshipment yard both in its current configuration and in any plausible future configuration. The model provides operational details, such as routes and schedules, for each train service, and allows to impose periodic timetables and schedules that keep the daily management of the yard easier. Both the model and its solutions are validated on a real Italian transshipment yard, located at Marzaglia, on different scenarios corresponding to different investment plans of RFI. The results show that proper investments allow to get a feasible timetable with a period of 24 hours with doubles the number of current train services.
△ Less
Submitted 3 October, 2024;
originally announced October 2024.
-
Optimization meets Machine Learning: An Exact Algorithm for Semi-Supervised Support Vector Machines
Authors:
Veronica Piccialli,
Jan Schwiddessen,
Antonio M. Sudoso
Abstract:
Support vector machines (SVMs) are well-studied supervised learning models for binary classification. In many applications, large amounts of samples can be cheaply and easily obtained. What is often a costly and error-prone process is to manually label these instances. Semi-supervised support vector machines (S3VMs) extend the well-known SVM classifiers to the semi-supervised approach, aiming at m…
▽ More
Support vector machines (SVMs) are well-studied supervised learning models for binary classification. In many applications, large amounts of samples can be cheaply and easily obtained. What is often a costly and error-prone process is to manually label these instances. Semi-supervised support vector machines (S3VMs) extend the well-known SVM classifiers to the semi-supervised approach, aiming at maximizing the margin between samples in the presence of unlabeled data. By leveraging both labeled and unlabeled data, S3VMs attempt to achieve better accuracy and robustness compared to traditional SVMs. Unfortunately, the resulting optimization problem is non-convex and hence difficult to solve exactly. In this paper, we present a new branch-and-cut approach for S3VMs using semidefinite programming (SDP) relaxations. We apply optimality-based bound tightening to bound the feasible set. Box constraints allow us to include valid inequalities, strengthening the lower bound. The resulting SDP relaxation provides bounds significantly stronger than the ones available in the literature. For the upper bound, instead, we define a local search exploiting the solution of the SDP relaxation. Computational results highlight the efficiency of the algorithm, showing its capability to solve instances with a number of data points 10 times larger than the ones solved in the literature.
△ Less
Submitted 25 November, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
Supervised Feature Compression based on Counterfactual Analysis
Authors:
Veronica Piccialli,
Dolores Romero Morales,
Cecilia Salvatore
Abstract:
Counterfactual Explanations are becoming a de-facto standard in post-hoc interpretable machine learning. For a given classifier and an instance classified in an undesired class, its counterfactual explanation corresponds to small perturbations of that instance that allows changing the classification outcome. This work aims to leverage Counterfactual Explanations to detect the important decision bo…
▽ More
Counterfactual Explanations are becoming a de-facto standard in post-hoc interpretable machine learning. For a given classifier and an instance classified in an undesired class, its counterfactual explanation corresponds to small perturbations of that instance that allows changing the classification outcome. This work aims to leverage Counterfactual Explanations to detect the important decision boundaries of a pre-trained black-box model. This information is used to build a supervised discretization of the features in the dataset with a tunable granularity. Using the discretized dataset, an optimal Decision Tree can be trained that resembles the black-box model, but that is interpretable and compact. Numerical results on real-world datasets show the effectiveness of the approach in terms of accuracy and sparsity.
△ Less
Submitted 24 November, 2023; v1 submitted 17 November, 2022;
originally announced November 2022.
-
Fix and Bound: An efficient approach for solving large-scale quadratic programming problems with box constraints
Authors:
Marco Locatelli,
Veronica Piccialli,
Antonio M. Sudoso
Abstract:
In this paper, we propose a branch-and-bound algorithm for solving nonconvex quadratic programming problems with box constraints (BoxQP). Our approach combines existing tools, such as semidefinite programming (SDP) bounds strengthened through valid inequalities, with a new class of optimality-based linear cuts which leads to variable fixing. The most important effect of fixing the value of some va…
▽ More
In this paper, we propose a branch-and-bound algorithm for solving nonconvex quadratic programming problems with box constraints (BoxQP). Our approach combines existing tools, such as semidefinite programming (SDP) bounds strengthened through valid inequalities, with a new class of optimality-based linear cuts which leads to variable fixing. The most important effect of fixing the value of some variables is the size reduction along the branch-and-bound tree, allowing to compute bounds by solving SDPs of smaller dimension. Extensive computational experiments over large dimensional (up to $n=200$) test instances show that our method is the state-of-the-art solver on large-scale BoxQPs. Furthermore, we test the proposed approach on the class of binary QP problems, where it exhibits competitive performance with state-of-the-art solvers.
△ Less
Submitted 2 October, 2024; v1 submitted 16 November, 2022;
originally announced November 2022.
-
Global Optimization for Cardinality-constrained Minimum Sum-of-Squares Clustering via Semidefinite Programming
Authors:
Veronica Piccialli,
Antonio M. Sudoso
Abstract:
The minimum sum-of-squares clustering (MSSC), or k-means type clustering, has been recently extended to exploit prior knowledge on the cardinality of each cluster. Such knowledge is used to increase performance as well as solution quality. In this paper, we propose a global optimization approach based on the branch-and-cut technique to solve the cardinality-constrained MSSC. For the lower bound ro…
▽ More
The minimum sum-of-squares clustering (MSSC), or k-means type clustering, has been recently extended to exploit prior knowledge on the cardinality of each cluster. Such knowledge is used to increase performance as well as solution quality. In this paper, we propose a global optimization approach based on the branch-and-cut technique to solve the cardinality-constrained MSSC. For the lower bound routine, we use the semidefinite programming (SDP) relaxation recently proposed by Rujeerapaiboon et al. [SIAM J. Optim. 29(2), 1211-1239, (2019)]. However, this relaxation can be used in a branch-and-cut method only for small-size instances. Therefore, we derive a new SDP relaxation that scales better with the instance size and the number of clusters. In both cases, we strengthen the bound by adding polyhedral cuts. Benefiting from a tailored branching strategy which enforces pairwise constraints, we reduce the complexity of the problems arising in the children nodes. For the upper bound, instead, we present a local search procedure that exploits the solution of the SDP relaxation solved at each node. Computational results show that the proposed algorithm globally solves, for the first time, real-world instances of size 10 times larger than those solved by state-of-the-art exact methods.
△ Less
Submitted 7 September, 2023; v1 submitted 19 September, 2022;
originally announced September 2022.
-
An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering
Authors:
Veronica Piccialli,
Anna Russo Russo,
Antonio M. Sudoso
Abstract:
The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of takin…
▽ More
The minimum sum-of-squares clustering (MSSC), or k-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of taking advantage of background information in data clustering is called semi-supervised or constrained clustering. In this paper, we present a branch-and-cut algorithm for semi-supervised MSSC, where background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the lower bound procedure, we solve the semidefinite programming relaxation of the MSSC discrete optimization model, and we use a cutting-plane procedure for strengthening the bound. For the upper bound, instead, by using integer programming tools, we use an adaptation of the k-means algorithm to the constrained case. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points with different combinations of must-link and cannot-link constraints and with a generic number of features. This problem size is about four times larger than the one of the instances solved by state-of-the-art exact algorithms.
△ Less
Submitted 24 July, 2022; v1 submitted 30 November, 2021;
originally announced November 2021.
-
Mixed-Integer Nonlinear Programming for State-based Non-Intrusive Load Monitoring
Authors:
Marco Balletti,
Veronica Piccialli,
Antonio M. Sudoso
Abstract:
Energy disaggregation, known in the literature as Non-Intrusive Load Monitoring (NILM), is the task of inferring the energy consumption of each appliance given the aggregate signal recorded by a single smart meter. In this paper, we propose a novel two-stage optimization-based approach for energy disaggregation. In the first phase, a small training set consisting of disaggregated power profiles is…
▽ More
Energy disaggregation, known in the literature as Non-Intrusive Load Monitoring (NILM), is the task of inferring the energy consumption of each appliance given the aggregate signal recorded by a single smart meter. In this paper, we propose a novel two-stage optimization-based approach for energy disaggregation. In the first phase, a small training set consisting of disaggregated power profiles is used to estimate the parameters and the power states by solving a mixed integer programming problem. Once the model parameters are estimated, the energy disaggregation problem is formulated as a constrained binary quadratic optimization problem. We incorporate penalty terms that exploit prior knowledge on how the disaggregated traces are generated, and appliance-specific constraints characterizing the signature of different types of appliances operating simultaneously. Our approach is compared with existing optimization-based algorithms both on a synthetic dataset and on three real-world datasets. The proposed formulation is computationally efficient, able to disambiguate loads with similar consumption patterns, and successfully reconstruct the signatures of known appliances despite the presence of unmetered devices, thus overcoming the main drawbacks of the optimization-based methods available in the literature.
△ Less
Submitted 22 February, 2022; v1 submitted 16 June, 2021;
originally announced June 2021.
-
SOS-SDP: an Exact Solver for Minimum Sum-of-Squares Clustering
Authors:
Veronica Piccialli,
Antonio M. Sudoso,
Angelika Wiegele
Abstract:
The minimum sum-of-squares clustering problem (MSSC) consists of partitioning $n$ observations into $k$ clusters in order to minimize the sum of squared distances from the points to the centroid of their cluster. In this paper, we propose an exact algorithm for the MSSC problem based on the branch-and-bound technique. The lower bound is computed by using a cutting-plane procedure where valid inequ…
▽ More
The minimum sum-of-squares clustering problem (MSSC) consists of partitioning $n$ observations into $k$ clusters in order to minimize the sum of squared distances from the points to the centroid of their cluster. In this paper, we propose an exact algorithm for the MSSC problem based on the branch-and-bound technique. The lower bound is computed by using a cutting-plane procedure where valid inequalities are iteratively added to the Peng-Wei SDP relaxation. The upper bound is computed with the constrained version of k-means where the initial centroids are extracted from the solution of the SDP relaxation. In the branch-and-bound procedure, we incorporate instance-level must-link and cannot-link constraints to express knowledge about which data points should or should not be grouped together. We manage to reduce the size of the problem at each level preserving the structure of the SDP problem itself. The obtained results show that the approach allows to successfully solve for the first time real-world instances up to 4000 data points.
△ Less
Submitted 23 December, 2021; v1 submitted 23 April, 2021;
originally announced April 2021.
-
Improving P300 Speller performance by means of optimization and machine learning
Authors:
Luigi Bianchi,
Chiara Liti,
Giampaolo Liuzzi,
Veronica Piccialli,
Cecilia Salvatore
Abstract:
Brain-Computer Interfaces (BCIs) are systems allowing people to interact with the environment bypassing the natural neuromuscular and hormonal outputs of the peripheral nervous system (PNS). These interfaces record a user's brain activity and translate it into control commands for external devices, thus providing the PNS with additional artificial outputs. In this framework, the BCIs based on the…
▽ More
Brain-Computer Interfaces (BCIs) are systems allowing people to interact with the environment bypassing the natural neuromuscular and hormonal outputs of the peripheral nervous system (PNS). These interfaces record a user's brain activity and translate it into control commands for external devices, thus providing the PNS with additional artificial outputs. In this framework, the BCIs based on the P300 Event-Related Potentials (ERP), which represent the electrical responses recorded from the brain after specific events or stimuli, have proven to be particularly successful and robust. The presence or the absence of a P300 evoked potential within the EEG features is determined through a classification algorithm. Linear classifiers such as SWLDA and SVM are the most used for ERPs' classification. Due to the low signal-to-noise ratio of the EEG signals, multiple stimulation sequences (a.k.a. iterations) are carried out and then averaged before the signals being classified. However, while augmenting the number of iterations improves the Signal-to-Noise Ratio (SNR), it also slows down the process. In the early studies, the number of iterations was fixed (no stopping), but recently, several early stopping strategies have been proposed in the literature to dynamically interrupt the stimulation sequence when a certain criterion is met to enhance the communication rate. In this work, we explore how to improve the classification performances in P300 based BCIs by combining optimization and machine learning. First, we propose a new decision function that aims at improving classification performances in terms of accuracy and Information Transfer Rate both in a no stopping and early stopping environment. Then, we propose a new SVM training problem that aims to facilitate the target-detection process. Our approach proves to be effective on several publicly available datasets.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
Computing mixed strategies equilibria in presence of switching costs by the solution of nonconvex QP problems
Authors:
Giampaolo Liuzzi,
Marco Locatelli,
Veronica Piccialli,
Stefan Rass
Abstract:
In this paper we address game theory problems arising in the context of network security. In traditional game theory problems, given a defender and an attacker, one searches for mixed strategies which minimize a linear payoff functional. In the problems addressed in this paper an additional quadratic term is added to the minimization problem. Such term represents switching costs, i.e., the costs f…
▽ More
In this paper we address game theory problems arising in the context of network security. In traditional game theory problems, given a defender and an attacker, one searches for mixed strategies which minimize a linear payoff functional. In the problems addressed in this paper an additional quadratic term is added to the minimization problem. Such term represents switching costs, i.e., the costs for the defender of switching from a given strategy to another one at successive rounds of a Nash game. The resulting problems are nonconvex QP ones with linear constraints and turn out to be very challenging. We will show that the most recent approaches for the minimization of nonconvex QP functions over polytopes, including commercial solvers such as CPLEX and GUROBI, are unable to solve to optimality even test instances with n = 50 variables. For this reason, we propose to extend with them the current benchmark set of test instances for QP problems. We also present a spatial branch-and-bound approach for the solution of these problems, where a predominant role is played by an optimality-based domain reduction, with multiple solutions of LP problems at each node of the branch-and-bound tree. Of course, domain reductions are standard tools in spatial branch-and-bound approaches. However, our contribution lies in the observation that, from the computational point of view, a rather aggressive application of these tools appears to be the best way to tackle the proposed instances. Indeed, according to our experiments, while they make the computational cost per node high, this is largely compensated by the rather slow growth of the number of nodes in the branch-and-bound tree, so that the proposed approach strongly outperforms the existing solvers for QP problems.
△ Less
Submitted 20 September, 2020; v1 submitted 28 February, 2020;
originally announced February 2020.