-
Beyond adaptive gradient: Fast-Controlled Minibatch Algorithm for large-scale optimization
Authors:
Corrado Coppola,
Lorenzo Papa,
Irene Amerini,
Laura Palagi
Abstract:
Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch…
▽ More
Adaptive gradient methods have been increasingly adopted by deep learning community due to their fast convergence and reduced sensitivity to hyper-parameters. However, these methods come with limitations, such as increased memory requirements for elements like moving averages and a poorly understood convergence theory. To overcome these challenges, we introduce F-CMA, a Fast-Controlled Mini-batch Algorithm with a random reshuffling method featuring a sufficient decrease condition and a line-search procedure to ensure loss reduction per epoch, along with its deterministic proof of global convergence to a stationary point. To evaluate the F-CMA, we integrate it into conventional training protocols for classification tasks involving both convolutional neural networks and vision transformer models, allowing for a direct comparison with popular optimizers. Computational tests show significant improvements, including a decrease in the overall training time by up to 68%, an increase in per-epoch efficiency by up to 20%, and in model accuracy by up to 5%.
△ Less
Submitted 16 December, 2024; v1 submitted 24 November, 2024;
originally announced November 2024.
-
Computational issues in Optimization for Deep networks
Authors:
Corrado Coppola,
Lorenzo Papa,
Marco Boresta,
Irene Amerini,
Laura Palagi
Abstract:
The paper aims to investigate relevant computational issues of deep neural network architectures with an eye to the interaction between the optimization algorithm and the classification performance. In particular, we aim to analyze the behaviour of state-of-the-art optimization algorithms in relationship to their hyperparameters setting in order to detect robustness with respect to the choice of a…
▽ More
The paper aims to investigate relevant computational issues of deep neural network architectures with an eye to the interaction between the optimization algorithm and the classification performance. In particular, we aim to analyze the behaviour of state-of-the-art optimization algorithms in relationship to their hyperparameters setting in order to detect robustness with respect to the choice of a certain starting point in ending on different local solutions. We conduct extensive computational experiments using nine open-source optimization algorithms to train deep Convolutional Neural Network architectures on an image multi-class classification task. Precisely, we consider several architectures by changing the number of layers and neurons per layer, in order to evaluate the impact of different width and depth structures on the computational optimization performance.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Feature selection in linear SVMs via a hard cardinality constraint: a scalable SDP decomposition approach
Authors:
Immanuel Bomze,
Federico D'Onofrio,
Laura Palagi,
Bo Peng
Abstract:
In this paper, we study the embedded feature selection problem in linear Support Vector Machines (SVMs), in which a cardinality constraint is employed, leading to an interpretable classification model. The problem is NP-hard due to the presence of the cardinality constraint, even though the original linear SVM amounts to a problem solvable in polynomial time. To handle the hard problem, we first i…
▽ More
In this paper, we study the embedded feature selection problem in linear Support Vector Machines (SVMs), in which a cardinality constraint is employed, leading to an interpretable classification model. The problem is NP-hard due to the presence of the cardinality constraint, even though the original linear SVM amounts to a problem solvable in polynomial time. To handle the hard problem, we first introduce two mixed-integer formulations for which novel semidefinite relaxations are proposed. Exploiting the sparsity pattern of the relaxations, we decompose the problems and obtain equivalent relaxations in a much smaller cone, making the conic approaches scalable. To make the best usage of the decomposed relaxations, we propose heuristics using the information of its optimal solution. Moreover, an exact procedure is proposed by solving a sequence of mixed-integer decomposed semidefinite optimization problems. Numerical results on classical benchmarking datasets are reported, showing the efficiency and effectiveness of our approach.
△ Less
Submitted 19 December, 2024; v1 submitted 15 April, 2024;
originally announced April 2024.
-
Benders decomposition for congested partial set covering location with uncertain demand
Authors:
Alice Calamita,
Ivana Ljubić,
Laura Palagi
Abstract:
In this paper, we introduce a mixed integer quadratic formulation for the congested variant of the partial set covering location problem, which involves determining a subset of facility locations to open and efficiently allocating customers to these facilities to minimize the combined costs of facility opening and congestion while ensuring target coverage. To enhance the resilience of the solution…
▽ More
In this paper, we introduce a mixed integer quadratic formulation for the congested variant of the partial set covering location problem, which involves determining a subset of facility locations to open and efficiently allocating customers to these facilities to minimize the combined costs of facility opening and congestion while ensuring target coverage. To enhance the resilience of the solution against demand fluctuations, we address the case under uncertain customer demand using $Γ$-robustness. We formulate the deterministic problem and its robust counterpart as mixed-integer quadratic problems. We investigate the effect of the protection level in adapted instances from the literature to provide critical insights into how sensitive the planning is to the protection level. Moreover, since the size of the robust counterpart grows with the number of customers, which could be significant in real-world contexts, we propose the use of Benders decomposition to effectively reduce the number of variables by projecting out of the master problem all the variables dependent on the number of customers. We illustrate how to incorporate our Benders approach within a mixed-integer second-order cone programming (MISOCP) solver, addressing explicitly all the ingredients that are instrumental for its success. We discuss single-tree and multi-tree approaches and introduce a perturbation technique to deal with the degeneracy of the Benders subproblem efficiently. Our tailored Benders approaches outperform the perspective reformulation solved using the state-of-the-art MISOCP solver Gurobi on adapted instances from the literature.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
CMA Light: a novel Minibatch Algorithm for large-scale non convex finite sum optimization
Authors:
Corrado Coppola,
Giampaolo Liuzzi,
Laura Palagi
Abstract:
The supervised training of a deep neural network on a given dataset consists in the unconstrained minimization of the finite sum of continuously differentiable functions, commonly referred to as loss with respect to the samples. These functions depend on the network parameters and most of the times are non-convex. We develop CMA Light, a globally convergent mini-batch gradient method to tackle thi…
▽ More
The supervised training of a deep neural network on a given dataset consists in the unconstrained minimization of the finite sum of continuously differentiable functions, commonly referred to as loss with respect to the samples. These functions depend on the network parameters and most of the times are non-convex. We develop CMA Light, a globally convergent mini-batch gradient method to tackle this problem. We consider the recently introduced Controlled Minibatch Algorithm (CMA) framework and we overcome its main bottleneck, removing the need for at least one evaluation of the whole objective function per iteration. We prove globally convergence of CMA Light under mild assumptions and we discuss extensive computational results on the same experimental test-bed used for CMA, showing that CMA Light requires less computational effort than most of the state-of-the-art optimizers. Eventually, we present early results on a large-scale Image Classification task.
△ Less
Submitted 22 May, 2024; v1 submitted 28 July, 2023;
originally announced July 2023.
-
A computational study of off-the-shelf MINLP solvers on a benchmark set of congested capacitated facility location problems
Authors:
Pasquale Avella,
Alice Calamita,
Laura Palagi
Abstract:
This paper analyzes the performance of five well-known off-the-shelf optimization solvers on a set of congested capacitated facility location problems formulated as mixed-integer conic programs (MICPs). We aim to compare the computational efficiency of the solvers and examine the solution strategies they adopt when solving instances with different sizes and complexity. The solvers we compare are G…
▽ More
This paper analyzes the performance of five well-known off-the-shelf optimization solvers on a set of congested capacitated facility location problems formulated as mixed-integer conic programs (MICPs). We aim to compare the computational efficiency of the solvers and examine the solution strategies they adopt when solving instances with different sizes and complexity. The solvers we compare are Gurobi, Cplex, Mosek, Xpress, and Scip. We run extensive numerical tests on a testbed of 30 instances from the literature. Our results show that Mosek and Gurobi are the most competitive solvers, as they achieve better time and gap performance, solving most instances within the time limit. Mosek outperforms Gurobi in large-size problems and provides more accurate solutions in terms of feasibility. Xpress solves to optimality about half of the instances tested within the time limit, and in this half, it achieves performance similar to that of Gurobi and Mosek. Cplex and Scip emerge as the least competitive solvers. The results provide guidelines on how each solver behaves on this class of problems and highlight the importance of choosing a solver suited to the problem type.
△ Less
Submitted 2 August, 2023; v1 submitted 7 March, 2023;
originally announced March 2023.
-
Unboxing Tree Ensembles for interpretability: a hierarchical visualization tool and a multivariate optimal re-built tree
Authors:
Giulia Di Teodoro,
Marta Monaci,
Laura Palagi
Abstract:
The interpretability of models has become a crucial issue in Machine Learning because of algorithmic decisions' growing impact on real-world applications. Tree ensemble methods, such as Random Forests or XgBoost, are powerful learning tools for classification tasks. However, while combining multiple trees may provide higher prediction quality than a single one, it sacrifices the interpretability p…
▽ More
The interpretability of models has become a crucial issue in Machine Learning because of algorithmic decisions' growing impact on real-world applications. Tree ensemble methods, such as Random Forests or XgBoost, are powerful learning tools for classification tasks. However, while combining multiple trees may provide higher prediction quality than a single one, it sacrifices the interpretability property resulting in "black-box" models. In light of this, we aim to develop an interpretable representation of a tree-ensemble model that can provide valuable insights into its behavior. First, given a target tree-ensemble model, we develop a hierarchical visualization tool based on a heatmap representation of the forest's feature use, considering the frequency of a feature and the level at which it is selected as an indicator of importance. Next, we propose a mixed-integer linear programming (MILP) formulation for constructing a single optimal multivariate tree that accurately mimics the target model predictions. The goal is to provide an interpretable surrogate model based on oblique hyperplane splits, which uses only the most relevant features according to the defined forest's importance indicators. The MILP model includes a penalty on feature selection based on their frequency in the forest to further induce sparsity of the splits. The natural formulation has been strengthened to improve the computational performance of {mixed-integer} software. Computational experience is carried out on benchmark datasets from the UCI repository using a state-of-the-art off-the-shelf solver. Results show that the proposed model is effective in yielding a shallow interpretable tree approximating the tree-ensemble decision function.
△ Less
Submitted 18 January, 2024; v1 submitted 15 February, 2023;
originally announced February 2023.
-
Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness
Authors:
Ruggiero Seccia,
Corrado Coppola,
Giampaolo Liuzzi,
Laura Palagi
Abstract:
In this work, we consider minimizing the average of a very large number of smooth and possibly non-convex functions, and we focus on two widely used minibatch frameworks to tackle this optimization problem: Incremental Gradient (IG) and Random Reshuffling (RR). We define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort {but} can be proved to…
▽ More
In this work, we consider minimizing the average of a very large number of smooth and possibly non-convex functions, and we focus on two widely used minibatch frameworks to tackle this optimization problem: Incremental Gradient (IG) and Random Reshuffling (RR). We define ease-controlled modifications of the IG/RR schemes, which require a light additional computational effort {but} can be proved to converge under {weak} and standard assumptions. In particular, we define two algorithmic schemes in which the IG/RR iteration is controlled by using a watchdog rule and a derivative-free linesearch that activates only sporadically to guarantee convergence. The two schemes differ in the watchdog and the linesearch, which are performed using either a monotonic or a non-monotonic rule. The two schemes also allow controlling the updating of the stepsize used in the main IG/RR iteration, avoiding the use of pre-set rules that may drive the stepsize to zero too fast, reducing the effort in designing effective updating rules of the stepsize. We prove convergence under the mild assumption of Lipschitz continuity of the gradients of the component functions and perform extensive computational analysis using different deep neural architectures and a benchmark of varying-size datasets. We compare our implementation with both a full batch gradient method (i.e. L-BFGS) and an implementation of IG/RR methods, proving that our algorithms require a similar computational effort compared to the other online algorithms and that the control on the learning rate may allow a faster decrease of the objective function.
△ Less
Submitted 20 May, 2024; v1 submitted 4 December, 2022;
originally announced December 2022.
-
Margin Optimal Classification Trees
Authors:
Federico D'Onofrio,
Giorgio Grani,
Marta Monaci,
Laura Palagi
Abstract:
In recent years, there has been growing attention to interpretable machine learning models which can give explanatory insights on their behaviour. Thanks to their interpretability, decision trees have been intensively studied for classification tasks and, due to the remarkable advances in mixed integer programming (MIP), various approaches have been proposed to formulate the problem of training an…
▽ More
In recent years, there has been growing attention to interpretable machine learning models which can give explanatory insights on their behaviour. Thanks to their interpretability, decision trees have been intensively studied for classification tasks and, due to the remarkable advances in mixed integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing sparsity of the hyperplanes' coefficients. First, MARGOT has been tested on non-linearly separable synthetic datasets in a 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions effectively select the most relevant features, maintaining good prediction quality.
△ Less
Submitted 8 October, 2023; v1 submitted 19 October, 2022;
originally announced October 2022.
-
Solving the vehicle routing problem with deep reinforcement learning
Authors:
Simone Foa,
Corrado Coppola,
Giorgio Grani,
Laura Palagi
Abstract:
Recently, the applications of the methodologies of Reinforcement Learning (RL) to NP-Hard Combinatorial optimization problems have become a popular topic. This is essentially due to the nature of the traditional combinatorial algorithms, often based on a trial-and-error process. RL aims at automating this process. At this regard, this paper focuses on the application of RL for the Vehicle Routing…
▽ More
Recently, the applications of the methodologies of Reinforcement Learning (RL) to NP-Hard Combinatorial optimization problems have become a popular topic. This is essentially due to the nature of the traditional combinatorial algorithms, often based on a trial-and-error process. RL aims at automating this process. At this regard, this paper focuses on the application of RL for the Vehicle Routing Problem (VRP), a famous combinatorial problem that belongs to the class of NP-Hard problems. In this work, first, the problem is modeled as a Markov Decision Process (MDP) and then the PPO method (which belongs to the Actor-Critic class of Reinforcement learning methods) is applied. In a second phase, the neural architecture behind the Actor and Critic has been established, choosing to adopt a neural architecture based on the Convolutional neural networks, both for the Actor and the Critic. This choice resulted in effectively addressing problems of different sizes. Experiments performed on a wide range of instances show that the algorithm has good generalization capabilities and can reach good solutions in a short time. Comparisons between the algorithm proposed and the state-of-the-art solver OR-TOOLS show that the latter still outperforms the Reinforcement learning algorithm. However, there are future research perspectives, that aim to upgrade the current performance of the algorithm proposed.
△ Less
Submitted 30 July, 2022;
originally announced August 2022.
-
Block Layer Decomposition schemes for training Deep Neural Networks
Authors:
Laura Palagi,
Ruggiero Seccia
Abstract:
Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on the solution of a very large nonconvex optimization problem that may have many local (no global) minimizers, saddle points and large plateaus. As a consequence, optimization algorithms can be attracted toward local minimizers which can lead to bad solutions or can slow down the optimization process. Furthermore, the time needed…
▽ More
Deep Feedforward Neural Networks' (DFNNs) weights estimation relies on the solution of a very large nonconvex optimization problem that may have many local (no global) minimizers, saddle points and large plateaus. As a consequence, optimization algorithms can be attracted toward local minimizers which can lead to bad solutions or can slow down the optimization process. Furthermore, the time needed to find good solutions to the training problem depends on both the number of samples and the number of variables. In this work, we show how Block Coordinate Descent (BCD) methods can be applied to improve performance of state-of-the-art algorithms by avoiding bad stationary points and flat regions. We first describe a batch BCD method ables to effectively tackle the network's depth and then we further extend the algorithm proposing a \textit{minibatch} BCD framework able to scale with respect to both the number of variables and the number of samples by embedding a BCD approach into a minibatch framework. By extensive numerical results on standard datasets for several architecture networks, we show how the application of BCD methods to the training phase of DFNNs permits to outperform standard batch and minibatch algorithms leading to an improvement on both the training phase and the generalization performance of the networks.
△ Less
Submitted 18 March, 2020;
originally announced March 2020.
-
A Class of Parallel Decomposition Algorithms for SVMs Training
Authors:
Andrea Manno,
Laura Palagi,
Simone Sagratella
Abstract:
The training of Support Vector Machines may be a very difficult task when dealing with very large datasets. The memory requirement and the time consumption of the SVMs algorithms grow rapidly with the increase of the data. To overcome these drawbacks, we propose a parallel decomposition algorithmic scheme for SVMs training for which we prove global convergence under suitable conditions. We outline…
▽ More
The training of Support Vector Machines may be a very difficult task when dealing with very large datasets. The memory requirement and the time consumption of the SVMs algorithms grow rapidly with the increase of the data. To overcome these drawbacks, we propose a parallel decomposition algorithmic scheme for SVMs training for which we prove global convergence under suitable conditions. We outline how these assumptions can be satisfied in practice and we suggest various specific implementations exploiting the adaptable structure of the algorithmic model.
△ Less
Submitted 3 November, 2015; v1 submitted 17 September, 2015;
originally announced September 2015.