-
Nonsmooth Implicit Differentiation: Deterministic and Stochastic Convergence Rates
Authors:
Riccardo Grazzi,
Massimiliano Pontil,
Saverio Salzo
Abstract:
We study the problem of efficiently computing the derivative of the fixed-point of a parametric nondifferentiable contraction map. This problem has wide applications in machine learning, including hyperparameter optimization, meta-learning and data poisoning attacks. We analyze two popular approaches: iterative differentiation (ITD) and approximate implicit differentiation (AID). A key challenge b…
▽ More
We study the problem of efficiently computing the derivative of the fixed-point of a parametric nondifferentiable contraction map. This problem has wide applications in machine learning, including hyperparameter optimization, meta-learning and data poisoning attacks. We analyze two popular approaches: iterative differentiation (ITD) and approximate implicit differentiation (AID). A key challenge behind the nonsmooth setting is that the chain rule does not hold anymore. We build upon the work by Bolte et al. (2022), who prove linear convergence of nonsmooth ITD under a piecewise Lipschitz smooth assumption. In the deterministic case, we provide a linear rate for AID and an improved linear rate for ITD which closely match the ones for the smooth setting. We further introduce NSID, a new stochastic method to compute the implicit derivative when the contraction map is defined as the composition of an outer map and an inner map which is accessible only through a stochastic unbiased estimator. We establish rates for the convergence of NSID, encompassing the best available rates in the smooth setting. We also present illustrative experiments confirming our analysis.
△ Less
Submitted 4 June, 2024; v1 submitted 18 March, 2024;
originally announced March 2024.
-
High Probability Bounds for Stochastic Subgradient Schemes with Heavy Tailed Noise
Authors:
Daniela A. Parletta,
Andrea Paudice,
Massimiliano Pontil,
Saverio Salzo
Abstract:
In this work we study high probability bounds for stochastic subgradient methods under heavy tailed noise. In this setting the noise is only assumed to have finite variance as opposed to a sub-Gaussian distribution for which it is known that standard subgradient methods enjoys high probability bounds. We analyzed a clipped version of the projected stochastic subgradient method, where subgradient e…
▽ More
In this work we study high probability bounds for stochastic subgradient methods under heavy tailed noise. In this setting the noise is only assumed to have finite variance as opposed to a sub-Gaussian distribution for which it is known that standard subgradient methods enjoys high probability bounds. We analyzed a clipped version of the projected stochastic subgradient method, where subgradient estimates are truncated whenever they have large norms. We show that this clipping strategy leads both to near optimal any-time and finite horizon bounds for many classical averaging schemes. Preliminary experiments are shown to support the validity of the method.
△ Less
Submitted 14 April, 2024; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Bilevel Optimization with a Lower-level Contraction: Optimal Sample Complexity without Warm-start
Authors:
Riccardo Grazzi,
Massimiliano Pontil,
Saverio Salzo
Abstract:
We analyse a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, equilibrium models, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have pro…
▽ More
We analyse a general class of bilevel problems, in which the upper-level problem consists in the minimization of a smooth objective function and the lower-level problem is to find the fixed point of a smooth contraction map. This type of problems include instances of meta-learning, equilibrium models, hyperparameter optimization and data poisoning adversarial attacks. Several recent works have proposed algorithms which warm-start the lower-level problem, i.e.~they use the previous lower-level approximate solution as a staring point for the lower-level solver. This warm-start procedure allows one to improve the sample complexity in both the stochastic and deterministic settings, achieving in some cases the order-wise optimal sample complexity. However, there are situations, e.g., meta learning and equilibrium models, in which the warm-start procedure is not well-suited or ineffective. In this work we show that without warm-start, it is still possible to achieve order-wise (near) optimal sample complexity. In particular, we propose a simple method which uses (stochastic) fixed point iterations at the lower-level and projected inexact gradient descent at the upper-level, that reaches an $ε$-stationary point using $O(ε^{-2})$ and $\tilde{O}(ε^{-1})$ samples for the stochastic and the deterministic setting, respectively. Finally, compared to methods using warm-start, our approach yields a simpler analysis that does not need to study the coupled interactions between the upper-level and lower-level iterates.
△ Less
Submitted 16 November, 2023; v1 submitted 7 February, 2022;
originally announced February 2022.
-
Convergence of Batch Greenkhorn for Regularized Multimarginal Optimal Transport
Authors:
Vladimir Kostic,
Saverio Salzo,
Massimilano Pontil
Abstract:
In this work we propose a batch version of the Greenkhorn algorithm for multimarginal regularized optimal transport problems. Our framework is general enough to cover, as particular cases, some existing algorithms like Sinkhorn and Greenkhorn algorithm for the bi-marginal setting, and (greedy) MultiSinkhorn for multimarginal optimal transport. We provide a complete convergence analysis, which is b…
▽ More
In this work we propose a batch version of the Greenkhorn algorithm for multimarginal regularized optimal transport problems. Our framework is general enough to cover, as particular cases, some existing algorithms like Sinkhorn and Greenkhorn algorithm for the bi-marginal setting, and (greedy) MultiSinkhorn for multimarginal optimal transport. We provide a complete convergence analysis, which is based on the properties of the iterative Bregman projections (IBP) method with greedy control. Global linear rate of convergence and explicit bound on the iteration complexity are obtained. When specialized to above mentioned algorithms, our results give new insights and/or improve existing ones.
△ Less
Submitted 3 December, 2021; v1 submitted 1 December, 2021;
originally announced December 2021.
-
Convergence Properties of Stochastic Hypergradients
Authors:
Riccardo Grazzi,
Massimiliano Pontil,
Saverio Salzo
Abstract:
Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important wh…
▽ More
Bilevel optimization problems are receiving increasing attention in machine learning as they provide a natural framework for hyperparameter optimization and meta-learning. A key step to tackle these problems is the efficient computation of the gradient of the upper-level objective (hypergradient). In this work, we study stochastic approximation schemes for the hypergradient, which are important when the lower-level problem is empirical risk minimization on a large dataset. The method that we propose is a stochastic variant of the approximate implicit differentiation approach in (Pedregosa, 2016). We provide bounds for the mean square error of the hypergradient approximation, under the assumption that the lower-level problem is accessible only through a stochastic mapping which is a contraction in expectation. In particular, our main bound is agnostic to the choice of the two stochastic solvers employed by the procedure. We provide numerical experiments to support our theoretical analysis and to show the advantage of using stochastic hypergradients in practice.
△ Less
Submitted 17 May, 2025; v1 submitted 13 November, 2020;
originally announced November 2020.
-
On the Iteration Complexity of Hypergradient Computation
Authors:
Riccardo Grazzi,
Luca Franceschi,
Massimiliano Pontil,
Saverio Salzo
Abstract:
We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or…
▽ More
We study a general class of bilevel problems, consisting in the minimization of an upper-level objective which depends on the solution to a parametric fixed-point equation. Important instances arising in machine learning include hyperparameter optimization, meta-learning, and certain graph and recurrent neural networks. Typically the gradient of the upper-level objective (hypergradient) is hard or even impossible to compute exactly, which has raised the interest in approximation methods. We investigate some popular approaches to compute the hypergradient, based on reverse mode iterative differentiation and approximate implicit differentiation. Under the hypothesis that the fixed point equation is defined by a contraction mapping, we present a unified analysis which allows for the first time to quantitatively compare these methods, providing explicit bounds for their iteration complexity. This analysis suggests a hierarchy in terms of computational efficiency among the above methods, with approximate implicit differentiation based on conjugate gradient performing best. We present an extensive experimental comparison among the methods which confirm the theoretical findings.
△ Less
Submitted 10 July, 2020; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Efficient Tensor Kernel methods for sparse regression
Authors:
Feliks Hibraj,
Marcello Pelillo,
Saverio Salzo,
Massimiliano Pontil
Abstract:
Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed, they solve an lp-norm regularization problem, with p=m/(m-1) and m even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors requires a considerable…
▽ More
Recently, classical kernel methods have been extended by the introduction of suitable tensor kernels so to promote sparsity in the solution of the underlying regression problem. Indeed, they solve an lp-norm regularization problem, with p=m/(m-1) and m even integer, which happens to be close to a lasso problem. However, a major drawback of the method is that storing tensors requires a considerable amount of memory, ultimately limiting its applicability. In this work we address this problem by proposing two advances. First, we directly reduce the memory requirement, by intriducing a new and more efficient layout for storing the data. Second, we use a Nystrom-type subsampling approach, which allows for a training phase with a smaller number of data points, so to reduce the computational cost. Experiments, both on synthetic and read datasets, show the effectiveness of the proposed improvements. Finally, we take case of implementing the cose in C++ so to further speed-up the computation.
△ Less
Submitted 23 March, 2020;
originally announced March 2020.
-
Sinkhorn Barycenters with Free Support via Frank-Wolfe Algorithm
Authors:
Giulia Luise,
Saverio Salzo,
Massimiliano Pontil,
Carlo Ciliberto
Abstract:
We present a novel algorithm to estimate the barycenter of arbitrary probability distributions with respect to the Sinkhorn divergence. Based on a Frank-Wolfe optimization strategy, our approach proceeds by populating the support of the barycenter incrementally, without requiring any pre-allocation. We consider discrete as well as continuous distributions, proving convergence rates of the proposed…
▽ More
We present a novel algorithm to estimate the barycenter of arbitrary probability distributions with respect to the Sinkhorn divergence. Based on a Frank-Wolfe optimization strategy, our approach proceeds by populating the support of the barycenter incrementally, without requiring any pre-allocation. We consider discrete as well as continuous distributions, proving convergence rates of the proposed algorithm in both settings. Key elements of our analysis are a new result showing that the Sinkhorn divergence on compact domains has Lipschitz continuous gradient with respect to the Total Variation and a characterization of the sample complexity of Sinkhorn potentials. Experiments validate the effectiveness of our method in practice.
△ Less
Submitted 30 May, 2019;
originally announced May 2019.
-
Far-HO: A Bilevel Programming Package for Hyperparameter Optimization and Meta-Learning
Authors:
Luca Franceschi,
Riccardo Grazzi,
Massimiliano Pontil,
Saverio Salzo,
Paolo Frasconi
Abstract:
In (Franceschi et al., 2018) we proposed a unified mathematical framework, grounded on bilevel programming, that encompasses gradient-based hyperparameter optimization and meta-learning. We formulated an approximate version of the problem where the inner objective is solved iteratively, and gave sufficient conditions ensuring convergence to the exact problem. In this work we show how to optimize l…
▽ More
In (Franceschi et al., 2018) we proposed a unified mathematical framework, grounded on bilevel programming, that encompasses gradient-based hyperparameter optimization and meta-learning. We formulated an approximate version of the problem where the inner objective is solved iteratively, and gave sufficient conditions ensuring convergence to the exact problem. In this work we show how to optimize learning rates, automatically weight the loss of single examples and learn hyper-representations with Far-HO, a software package based on the popular deep learning framework TensorFlow that allows to seamlessly tackle both HO and ML problems.
△ Less
Submitted 13 June, 2018;
originally announced June 2018.
-
Bilevel Programming for Hyperparameter Optimization and Meta-Learning
Authors:
Luca Franceschi,
Paolo Frasconi,
Saverio Salzo,
Riccardo Grazzi,
Massimilano Pontil
Abstract:
We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised l…
▽ More
We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We provide sufficient conditions under which solutions of the approximate problem converge to those of the exact problem. We instantiate our approach for meta-learning in the case of deep learning where representation layers are treated as hyperparameters shared across a set of training episodes. In experiments, we confirm our theoretical findings, present encouraging results for few-shot learning and contrast the bilevel approach against classical approaches for learning-to-learn.
△ Less
Submitted 3 July, 2018; v1 submitted 13 June, 2018;
originally announced June 2018.
-
Latent Variable Time-varying Network Inference
Authors:
Federico Tomasi,
Veronica Tozzo,
Saverio Salzo,
Alessandro Verri
Abstract:
In many applications of finance, biology and sociology, complex systems involve entities interacting with each other. These processes have the peculiarity of evolving over time and of comprising latent factors, which influence the system without being explicitly measured. In this work we present latent variable time-varying graphical lasso (LTGL), a method for multivariate time-series graphical mo…
▽ More
In many applications of finance, biology and sociology, complex systems involve entities interacting with each other. These processes have the peculiarity of evolving over time and of comprising latent factors, which influence the system without being explicitly measured. In this work we present latent variable time-varying graphical lasso (LTGL), a method for multivariate time-series graphical modelling that considers the influence of hidden or unmeasurable factors. The estimation of the contribution of the latent factors is embedded in the model which produces both sparse and low-rank components for each time point. In particular, the first component represents the connectivity structure of observable variables of the system, while the second represents the influence of hidden factors, assumed to be few with respect to the observed variables. Our model includes temporal consistency on both components, providing an accurate evolutionary pattern of the system. We derive a tractable optimisation algorithm based on alternating direction method of multipliers, and develop a scalable and efficient implementation which exploits proximity operators in closed form. LTGL is extensively validated on synthetic data, achieving optimal performance in terms of accuracy, structure learning and scalability with respect to ground truth and state-of-the-art methods for graphical inference. We conclude with the application of LTGL to real case studies, from biology and finance, to illustrate how our method can be successfully employed to gain insights on multivariate time-series data.
△ Less
Submitted 2 August, 2018; v1 submitted 12 February, 2018;
originally announced February 2018.
-
Solving $\ell^p\!$-norm regularization with tensor kernels
Authors:
Saverio Salzo,
Johan A. K. Suykens,
Lorenzo Rosasco
Abstract:
In this paper, we discuss how a suitable family of tensor kernels can be used to efficiently solve nonparametric extensions of $\ell^p$ regularized learning methods. Our main contribution is proposing a fast dual algorithm, and showing that it allows to solve the problem efficiently. Our results contrast recent findings suggesting kernel methods cannot be extended beyond Hilbert setting. Numerical…
▽ More
In this paper, we discuss how a suitable family of tensor kernels can be used to efficiently solve nonparametric extensions of $\ell^p$ regularized learning methods. Our main contribution is proposing a fast dual algorithm, and showing that it allows to solve the problem efficiently. Our results contrast recent findings suggesting kernel methods cannot be extended beyond Hilbert setting. Numerical experiments confirm the effectiveness of the method.
△ Less
Submitted 18 October, 2017; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Generalized support vector regression: duality and tensor-kernel representation
Authors:
Saverio Salzo,
Johan A. K. Suykens
Abstract:
In this paper we study the variational problem associated to support vector regression in Banach function spaces. Using the Fenchel-Rockafellar duality theory, we give explicit formulation of the dual problem as well as of the related optimality conditions. Moreover, we provide a new computational framework for solving the problem which relies on a tensor-kernel representation. This analysis overc…
▽ More
In this paper we study the variational problem associated to support vector regression in Banach function spaces. Using the Fenchel-Rockafellar duality theory, we give explicit formulation of the dual problem as well as of the related optimality conditions. Moreover, we provide a new computational framework for solving the problem which relies on a tensor-kernel representation. This analysis overcomes the typical difficulties connected to learning in Banach spaces. We finally present a large class of tensor-kernels to which our theory fully applies: power series tensor kernels. This type of kernels describe Banach spaces of analytic functions and include generalizations of the exponential and polynomial kernels as well as, in the complex case, generalizations of the Szegö and Bergman kernels.
△ Less
Submitted 5 May, 2017; v1 submitted 18 March, 2016;
originally announced March 2016.