Search | arXiv e-print repository

Iteratively reweighted kernel machines efficiently learn sparse functions

Authors: Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, Maryam Fazel

Abstract: The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influ… ▽ More The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influential coordinates with low sample complexity. Moreover, by iteratively using the derivatives to reweight the data and retrain kernel machines, one is able to efficiently learn hierarchical polynomials with finite leap complexity. Numerical experiments illustrate the developed theory. △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2504.03148 [pdf, other]

Spectral norm bound for the product of random Fourier-Walsh matrices

Authors: Libin Zhu, Damek Davis, Dmitriy Drusvyatskiy, Maryam Fazel

Abstract: We consider matrix products of the form $A_1(A_2A_2)^\top\ldots(A_{m}A_{m}^\top)A_{m+1}$, where $A_i$ are normalized random Fourier-Walsh matrices. We identify an interesting polynomial scaling regime when the operator norm of the expected matrix product tends to zero as the dimension tends to infinity. We consider matrix products of the form $A_1(A_2A_2)^\top\ldots(A_{m}A_{m}^\top)A_{m+1}$, where $A_i$ are normalized random Fourier-Walsh matrices. We identify an interesting polynomial scaling regime when the operator norm of the expected matrix product tends to zero as the dimension tends to infinity. △ Less

Submitted 3 April, 2025; originally announced April 2025.

Comments: 18 pages, 2 figures

arXiv:2502.05305 [pdf, other]

Online Covariance Estimation in Nonsmooth Stochastic Approximation

Authors: Liwei Jiang, Abhishek Roy, Krishna Balasubramanian, Damek Davis, Dmitriy Drusvyatskiy, Sen Na

Abstract: We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potenti… ▽ More We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al.(2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: 46 pages, 1 figure

arXiv:2502.01886 [pdf, other]

Invariant Kernels: Rank Stabilization and Generalization Across Dimensions

Authors: Mateo Díaz, Dmitriy Drusvyatskiy, Jack Kendrick, Rekha R. Thomas

Abstract: Symmetry arises often when learning from high dimensional data. For example, data sets consisting of point clouds, graphs, and unordered sets appear routinely in contemporary applications, and exhibit rich underlying symmetries. Understanding the benefits of symmetry on the statistical and numerical efficiency of learning algorithms is an active area of research. In this work, we show that symmetr… ▽ More Symmetry arises often when learning from high dimensional data. For example, data sets consisting of point clouds, graphs, and unordered sets appear routinely in contemporary applications, and exhibit rich underlying symmetries. Understanding the benefits of symmetry on the statistical and numerical efficiency of learning algorithms is an active area of research. In this work, we show that symmetry has a pronounced impact on the rank of kernel matrices. Specifically, we compute the rank of a polynomial kernel of fixed degree that is invariant under various groups acting independently on its two arguments. In concrete circumstances, including the three aforementioned examples, symmetry dramatically decreases the rank making it independent of the data dimension. In such settings, we show that a simple regression procedure is minimax optimal for estimating an invariant polynomial from finitely many samples drawn across different dimensions. We complete the paper with numerical experiments that illustrate our findings. △ Less

Submitted 3 February, 2025; originally announced February 2025.

arXiv:2409.19791 [pdf, other]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Authors: Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

Abstract: A prevalent belief among optimization specialists is that linear convergence of gradient descent is contingent on the function growing quadratically away from its minimizers. In this work, we argue that this belief is inaccurate. We show that gradient descent with an adaptive stepsize converges at a local (nearly) linear rate on any smooth function that merely exhibits fourth-order growth away fro… ▽ More A prevalent belief among optimization specialists is that linear convergence of gradient descent is contingent on the function growing quadratically away from its minimizers. In this work, we argue that this belief is inaccurate. We show that gradient descent with an adaptive stepsize converges at a local (nearly) linear rate on any smooth function that merely exhibits fourth-order growth away from its minimizer. The adaptive stepsize we propose arises from an intriguing decomposition theorem: any such function admits a smooth manifold around the optimal solution -- which we call the ravine -- so that the function grows at least quadratically away from the ravine and has constant order growth along it. The ravine allows one to interlace many short gradient steps with a single long Polyak gradient step, which together ensure rapid convergence to the minimizer. We illustrate the theory and algorithm on the problems of matrix sensing and factorization and learning a single neuron in the overparameterized regime. △ Less

Submitted 29 September, 2024; originally announced September 2024.

Comments: 58 pages, 5 figures

MSC Class: 65K05; 65K10; 90C30; 90C06

arXiv:2405.09676 [pdf, ps, other]

The radius of statistical efficiency

Authors: Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy

Abstract: Classical results in asymptotic statistics show that the Fisher information matrix controls the difficulty of estimating a statistical model from observed data. In this work, we introduce a companion measure of robustness of an estimation problem: the radius of statistical efficiency (RSE) is the size of the smallest perturbation to the problem data that renders the Fisher information matrix singu… ▽ More Classical results in asymptotic statistics show that the Fisher information matrix controls the difficulty of estimating a statistical model from observed data. In this work, we introduce a companion measure of robustness of an estimation problem: the radius of statistical efficiency (RSE) is the size of the smallest perturbation to the problem data that renders the Fisher information matrix singular. We compute RSE up to numerical constants for a variety of test bed problems, including principal component analysis, generalized linear models, phase retrieval, bilinear sensing, and matrix completion. In all cases, the RSE quantifies the compatibility between the covariance of the population data and the latent model parameter. Interestingly, we observe a precise reciprocal relationship between RSE and the intrinsic complexity/sensitivity of the problem instance, paralleling the classical Eckart-Young theorem in numerical analysis. △ Less

Submitted 15 May, 2024; originally announced May 2024.

MSC Class: 90C15; 49K40; 62F12; 90C31

arXiv:2401.04553 [pdf, other]

Linear Recursive Feature Machines provably recover low-rank matrices

Authors: Adityanarayanan Radhakrishnan, Mikhail Belkin, Dmitriy Drusvyatskiy

Abstract: A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a… ▽ More A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion. △ Less

Submitted 9 January, 2024; originally announced January 2024.

arXiv:2306.02601 [pdf, other]

Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Mikhail Belkin, Damek Davis, Yi-An Ma

Abstract: Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method,… ▽ More Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2303.16277 [pdf, ps, other]

The slope robustly determines convex functions

Authors: Aris Daniilidis, Dmitriy Drusvyatskiy

Abstract: We show that the deviation between the slopes of two convex functions controls the deviation between the functions themselves. This result reveals that the slope -- a one dimensional construct -- robustly determines convex functions, up to a constant of integration. We show that the deviation between the slopes of two convex functions controls the deviation between the functions themselves. This result reveals that the slope -- a one dimensional construct -- robustly determines convex functions, up to a constant of integration. △ Less

Submitted 28 March, 2023; originally announced March 2023.

MSC Class: 26B25; 49K40; 37C10; 49J52

arXiv:2301.06632 [pdf, other]

Asymptotic normality and optimality in nonsmooth stochastic approximation

Authors: Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

Abstract: In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of Hájek and Le Cam. A long-standing open question in this line of work is whether similar guara… ▽ More In their seminal work, Polyak and Juditsky showed that stochastic approximation algorithms for solving smooth equations enjoy a central limit theorem. Moreover, it has since been argued that the asymptotic covariance of the method is best possible among any estimation procedure in a local minimax sense of Hájek and Le Cam. A long-standing open question in this line of work is whether similar guarantees hold for important non-smooth problems, such as stochastic nonlinear programming or stochastic variational inequalities. In this work, we show that this is indeed the case. △ Less

Submitted 16 January, 2023; originally announced January 2023.

Comments: The arxiv report arXiv:2108.11832 has been split into two parts. This is Part 2 of the original submission, augmented by a some new results and a reworked exposition

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:2207.04173 [pdf, other]

Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality

Authors: Joshua Cutler, Mateo Díaz, Dmitriy Drusvyatskiy

Abstract: We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotica… ▽ More We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of Hájek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal. △ Less

Submitted 13 March, 2024; v1 submitted 8 July, 2022; originally announced July 2022.

Comments: 49 pages, 1 figure. v2: revised asymptotic optimality results and reworked exposition. v3: minor updates

MSC Class: 90C15; 90C25

Journal ref: Journal of Machine Learning Research, 25(90):1-49, 2024

arXiv:2204.08281 [pdf, other]

Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments

Authors: Mitas Ray, Dmitriy Drusvyatskiy, Maryam Fazel, Lillian J. Ratliff

Abstract: This paper studies the problem of expected loss minimization given a data distribution that is dependent on the decision-maker's action and evolves dynamically in time according to a geometric decay process. Novel algorithms for both the information setting in which the decision-maker has a first order gradient oracle and the setting in which they have simply a loss function oracle are introduced.… ▽ More This paper studies the problem of expected loss minimization given a data distribution that is dependent on the decision-maker's action and evolves dynamically in time according to a geometric decay process. Novel algorithms for both the information setting in which the decision-maker has a first order gradient oracle and the setting in which they have simply a loss function oracle are introduced. The algorithms operate on the same underlying principle: the decision-maker repeatedly deploys a fixed decision over the length of an epoch, thereby allowing the dynamically changing environment to sufficiently mix before updating the decision. The iteration complexity in each of the settings is shown to match existing rates for first and zero order stochastic gradient methods up to logarithmic factors. The algorithms are evaluated on a "semi-synthetic" example using real world data from the SFpark dynamic pricing pilot study; it is shown that the announced prices result in an improvement for the institution's objective (target occupancy), while achieving an overall reduction in parking rates. △ Less

Submitted 8 April, 2022; originally announced April 2022.

Comments: Accepted at AAAI 2022

arXiv:2203.03756 [pdf, other]

Flat minima generalize for low-rank matrix recovery

Authors: Lijun Ding, Dmitriy Drusvyatskiy, Maryam Fazel, Zaid Harchaoui

Abstract: Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameter… ▽ More Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We conclude with synthetic experiments that illustrate our findings and discuss the effect of depth on flat solutions. △ Less

Submitted 17 February, 2023; v1 submitted 7 March, 2022; originally announced March 2022.

Comments: 36 pages

arXiv:2201.03398 [pdf, other]

Multiplayer Performative Prediction: Learning in Decision-Dependent Games

Authors: Adhyyan Narang, Evan Faulkner, Dmitriy Drusvyatskiy, Maryam Fazel, Lillian J. Ratliff

Abstract: Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called "multi-player performative prediction". We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter… ▽ More Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called "multi-player performative prediction". We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter equilibria are arguably more informative, but can be found efficiently only when the game is monotone. We show that under mild assumptions, the performatively stable equilibria can be found efficiently by a variety of algorithms, including repeated retraining and the repeated (stochastic) gradient method. We then establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results. △ Less

Submitted 6 April, 2022; v1 submitted 10 January, 2022; originally announced January 2022.

arXiv:2112.06969 [pdf, ps, other]

A gradient sampling method with complexity guarantees for Lipschitz functions in high and low dimensions

Authors: Damek Davis, Dmitriy Drusvyatskiy, Yin Tat Lee, Swati Padmanabhan, Guanghao Ye

Abstract: Zhang et al. introduced a novel modification of Goldstein's classical subgradient method, with an efficiency guarantee of $O(\varepsilon^{-4})$ for minimizing Lipschitz functions. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. In this paper, we show that both of these assumptions can be dropped by simply adding… ▽ More Zhang et al. introduced a novel modification of Goldstein's classical subgradient method, with an efficiency guarantee of $O(\varepsilon^{-4})$ for minimizing Lipschitz functions. Their work, however, makes use of a nonstandard subgradient oracle model and requires the function to be directionally differentiable. In this paper, we show that both of these assumptions can be dropped by simply adding a small random perturbation in each step of their algorithm. The resulting method works on any Lipschitz function whose value and gradient can be evaluated at points of differentiability. We additionally present a new cutting plane algorithm that achieves better efficiency in low dimensions: $O(d\varepsilon^{-3})$ for Lipschitz functions and $O(d\varepsilon^{-2})$ for those that are weakly convex. △ Less

Submitted 15 February, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

Comments: 14 pages

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:2111.09456 [pdf, ps, other]

Improved Rates for Derivative Free Gradient Play in Strongly Monotone Games

Authors: Dmitriy Drusvyatskiy, Maryam Fazel, Lillian J Ratliff

Abstract: The influential work of Bravo et al. 2018 shows that derivative free play in strongly monotone games has complexity $O(d^2/\varepsilon^3)$, where $\varepsilon$ is the target accuracy on the expected squared distance to the solution. This note shows that the efficiency estimate is actually $O(d^2/\varepsilon^2)$, which reduces to the known efficiency guarantee for the method in unconstrained optimi… ▽ More The influential work of Bravo et al. 2018 shows that derivative free play in strongly monotone games has complexity $O(d^2/\varepsilon^3)$, where $\varepsilon$ is the target accuracy on the expected squared distance to the solution. This note shows that the efficiency estimate is actually $O(d^2/\varepsilon^2)$, which reduces to the known efficiency guarantee for the method in unconstrained optimization. The argument we present simple interprets the method as stochastic gradient play on a slightly perturbed strongly monotone game. △ Less

Submitted 6 April, 2022; v1 submitted 17 November, 2021; originally announced November 2021.

arXiv:2108.11832 [pdf, other]

Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization

Authors: Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

Abstract: We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that capt… ▽ More We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle. △ Less

Submitted 9 January, 2023; v1 submitted 26 August, 2021; originally announced August 2021.

Comments: Version 1 of the arxiv report has been split into two parts. Version 2 of the arxiv report is Part 1 of the original submission. Part 2 will appear as a separate arxiv submission

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:2108.07356 [pdf, other]

Stochastic Optimization under Distributional Drift

Authors: Joshua Cutler, Dmitriy Drusvyatskiy, Zaid Harchaoui

Abstract: We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic converg… ▽ More We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking efficiency of the proximal stochastic gradient method benefits significantly from a step decay schedule. Numerical experiments illustrate our results. △ Less

Submitted 26 May, 2023; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: 56 pages, 7 figures. v2: unified analysis of time- and decision-dependent settings; updated numerical experiments. v3: added references and updated exposition. v4: minor updates to match the version published in JMLR

MSC Class: 90C15; 90C25

Journal ref: Journal of Machine Learning Research, 24(147):1-56, 2023

arXiv:2106.09815 [pdf, other]

Escaping strict saddle points of the Moreau envelope in nonsmooth optimization

Authors: Damek Davis, Mateo Díaz, Dmitriy Drusvyatskiy

Abstract: Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict sad… ▽ More Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms. △ Less

Submitted 17 June, 2021; originally announced June 2021.

Comments: 29 pages, 1 figure

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:2102.08484 [pdf, ps, other]

Conservative and semismooth derivatives are equivalent for semialgebraic maps

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: Subgradient and Newton algorithms for nonsmooth optimization require generalized derivatives to satisfy subtle approximation properties: conservativity for the former and semismoothness for the latter. Though these two properties originate in entirely different contexts, we show that in the semi-algebraic setting they are equivalent. Both properties for a generalized derivative simply require it t… ▽ More Subgradient and Newton algorithms for nonsmooth optimization require generalized derivatives to satisfy subtle approximation properties: conservativity for the former and semismoothness for the latter. Though these two properties originate in entirely different contexts, we show that in the semi-algebraic setting they are equivalent. Both properties for a generalized derivative simply require it to coincide with the standard directional derivative on the tangent spaces of some partition of the domain into smooth manifolds. An appealing byproduct is a new short proof that semi-algebraic maps are semismooth relative to the Clarke Jacobian. △ Less

Submitted 16 February, 2021; originally announced February 2021.

Comments: 12 pages

MSC Class: Primary: 49J53; 49J52; Secondary: 32B20; 14P15

arXiv:2011.11173 [pdf, other]

Stochastic optimization with decision-dependent distributions

Authors: Dmitriy Drusvyatskiy, Lin Xiao

Abstract: Stochastic optimization problems often involve data distributions that change in reaction to the decision variables. This is the case for example when members of the population respond to a deployed classifier by manipulating their features so as to improve the likelihood of being positively labeled. Recent works on performative prediction have identified an intriguing solution concept for such pr… ▽ More Stochastic optimization problems often involve data distributions that change in reaction to the decision variables. This is the case for example when members of the population respond to a deployed classifier by manipulating their features so as to improve the likelihood of being positively labeled. Recent works on performative prediction have identified an intriguing solution concept for such problems: find the decision that is optimal with respect to the static distribution that the decision induces. Continuing this line of work, we show that typical stochastic algorithms -- originally designed for static problems -- can be applied directly for finding such equilibria with little loss in efficiency. The reason is simple to explain: the main consequence of the distributional shift is that it corrupts algorithms with a bias that decays linearly with the distance to the solution. Using this perspective, we obtain sharp convergence guarantees for popular algorithms, such as stochastic gradient, clipped gradient, proximal point, and dual averaging methods, along with their accelerated and proximal variants. In realistic applications, deployment of a decision rule is often much more expensive than sampling. We show how to modify the aforementioned algorithms so as to maintain their sample efficiency while performing only logarithmically many deployments. △ Less

Submitted 13 December, 2020; v1 submitted 22 November, 2020; originally announced November 2020.

Comments: 60 pages

MSC Class: 90C15; 90C25

arXiv:2002.06309 [pdf, other]

doi 10.1137/20M1320225

Stochastic optimization over proximally smooth sets

Authors: Damek Davis, Dmitriy Drusvyatskiy, Zhan Shi

Abstract: We introduce a class of stochastic algorithms for minimizing weakly convex functions over proximally smooth sets. As their main building blocks, the algorithms use simplified models of the objective function and the constraint set, along with a retraction operation to restore feasibility. All the proposed methods come equipped with a finite time efficiency guarantee in terms of a natural stationar… ▽ More We introduce a class of stochastic algorithms for minimizing weakly convex functions over proximally smooth sets. As their main building blocks, the algorithms use simplified models of the objective function and the constraint set, along with a retraction operation to restore feasibility. All the proposed methods come equipped with a finite time efficiency guarantee in terms of a natural stationarity measure. We discuss consequences for nonsmooth optimization over smooth manifolds and over sets cut out by weakly-convex inequalities. △ Less

Submitted 14 February, 2020; originally announced February 2020.

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1912.07146 [pdf, other]

Proximal methods avoid active strict saddles of weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems. We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems. △ Less

Submitted 16 February, 2021; v1 submitted 15 December, 2019; originally announced December 2019.

Comments: 43 pages, 2 figures

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:1910.13604 [pdf, other]

Pathological subgradient dynamics

Authors: Aris Daniilidis, Dmitriy Drusvyatskiy

Abstract: We construct examples of Lipschitz continuous functions, with pathological subgradient dynamics both in continuous and discrete time. In both settings, the iterates generate bounded trajectories, and yet fail to detect any (generalized) critical points of the function. We construct examples of Lipschitz continuous functions, with pathological subgradient dynamics both in continuous and discrete time. In both settings, the iterates generate bounded trajectories, and yet fail to detect any (generalized) critical points of the function. △ Less

Submitted 29 October, 2019; originally announced October 2019.

Comments: 14 pages, 1 figure

MSC Class: 90C30; 49J52; 65K10

arXiv:1908.07615 [pdf, other]

Iterative Linearized Control: Stable Algorithms and Complexity Guarantees

Authors: Vincent Roulet, Siddhartha Srinivasa, Dmitriy Drusvyatskiy, Zaid Harchaoui

Abstract: We examine popular gradient-based algorithms for nonlinear control in the light of the modern complexity analysis of first-order optimization algorithms. The examination reveals that the complexity bounds can be clearly stated in terms of calls to a computational oracle related to dynamic programming and implementable by gradient back-propagation using machine learning software libraries such as P… ▽ More We examine popular gradient-based algorithms for nonlinear control in the light of the modern complexity analysis of first-order optimization algorithms. The examination reveals that the complexity bounds can be clearly stated in terms of calls to a computational oracle related to dynamic programming and implementable by gradient back-propagation using machine learning software libraries such as PyTorch or TensorFlow. Finally, we propose a regularized Gauss-Newton algorithm enjoying worst-case complexity bounds and improved convergence behavior in practice. The software library based on PyTorch is publicly available. △ Less

Submitted 20 August, 2019; originally announced August 2019.

Comments: Short version appeared in International Conference on Machine Learning (ICML) 2019

arXiv:1907.13307 [pdf, ps, other]

From low probability to high confidence in stochastic convex optimization

Authors: Damek Davis, Dmitriy Drusvyatskiy, Lin Xiao, Junyu Zhang

Abstract: Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strong… ▽ More Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization. △ Less

Submitted 16 October, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

Comments: 37 pages

MSC Class: 65K05; 65K10; 90C15; 90C25

arXiv:1907.09547 [pdf, other]

Stochastic algorithms with geometric step decay converge linearly on sharp functions

Authors: Damek Davis, Dmitriy Drusvyatskiy, Vasileios Charisopoulos

Abstract: Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work,… ▽ More Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions. △ Less

Submitted 22 July, 2019; originally announced July 2019.

MSC Class: 65K05; 65K10; 90C15; 90C30; 90C06

arXiv:1904.10020 [pdf, other]

Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence

Authors: Vasileios Charisopoulos, Yudong Chen, Damek Davis, Mateo Díaz, Lijun Ding, Dmitriy Drusvyatskiy

Abstract: The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations d… ▽ More The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach. △ Less

Submitted 22 April, 2019; originally announced April 2019.

Comments: 80 pages

MSC Class: 65K10; 90C06

arXiv:1901.01624 [pdf, other]

Composite optimization for robust blind deconvolution

Authors: Vasileios Charisopoulos, Damek Davis, Mateo Díaz, Dmitriy Drusvyatskiy

Abstract: The blind deconvolution problem seeks to recover a pair of vectors from a set of rank one bilinear measurements. We consider a natural nonsmooth formulation of the problem and show that under standard statistical assumptions, its moduli of weak convexity, sharpness, and Lipschitz continuity are all dimension independent. This phenomenon persists even when up to half of the measurements are corrupt… ▽ More The blind deconvolution problem seeks to recover a pair of vectors from a set of rank one bilinear measurements. We consider a natural nonsmooth formulation of the problem and show that under standard statistical assumptions, its moduli of weak convexity, sharpness, and Lipschitz continuity are all dimension independent. This phenomenon persists even when up to half of the measurements are corrupted by noise. Consequently, standard algorithms, such as the subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. We then complete the paper with a new initialization strategy, complementing the local search algorithms. The initialization procedure is both provably efficient and robust to outlying measurements. Numerical experiments, on both simulated and real data, illustrate the developed theory and methods. △ Less

Submitted 18 January, 2019; v1 submitted 6 January, 2019; originally announced January 2019.

Comments: 60 pages, 14 figures

MSC Class: 65K10; 90C06

arXiv:1811.01298 [pdf, ps, other]

Inexact alternating projections on nonconvex sets

Authors: Dmitriy Drusvyatskiy, Adrian S. Lewis

Abstract: Given two arbitrary closed sets in Euclidean space, a simple transversality condition guarantees that the method of alternating projections converges locally, at linear rate, to a point in the intersection. Exact projection onto nonconvex sets is typically intractable, but we show that computationally-cheap inexact projections may suffice instead. In particular, if one set is defined by sufficient… ▽ More Given two arbitrary closed sets in Euclidean space, a simple transversality condition guarantees that the method of alternating projections converges locally, at linear rate, to a point in the intersection. Exact projection onto nonconvex sets is typically intractable, but we show that computationally-cheap inexact projections may suffice instead. In particular, if one set is defined by sufficiently regular smooth constraints, then projecting onto the approximation obtained by linearizing those constraints around the current iterate suffices. On the other hand, if one set is a smooth manifold represented through local coordinates, then the approximate projection resulting from linearizing the coordinate system around the preceding iterate on the manifold also suffices. △ Less

Submitted 3 November, 2018; originally announced November 2018.

MSC Class: 49M20; 65K10; 90C30

arXiv:1810.07590 [pdf, ps, other]

Graphical Convergence of Subgradients in Nonconvex Optimization and Learning

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-… ▽ More We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model. As an application of the developed techniques, we analyze the nonsmooth landscape of a robust nonlinear regression problem. △ Less

Submitted 17 December, 2018; v1 submitted 17 October, 2018; originally announced October 2018.

Comments: 36 pages

MSC Class: 65K10; 90C15; 68Q32

arXiv:1807.00255 [pdf, ps, other]

Stochastic model-based minimization under high-order growth

Authors: Damek Davis, Dmitriy Drusvyatskiy, Kellie J. MacPhee

Abstract: Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional co… ▽ More Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional convexity and relative strong convexity assumptions, the function values converge to the minimum at the rate of $O(k^{-1/2})$ and $\widetilde{O}(k^{-1})$, respectively. We discuss consequences for stochastic proximal point, mirror descent, regularized Gauss-Newton, and saddle point algorithms. △ Less

Submitted 30 June, 2018; originally announced July 2018.

Comments: 30 pages

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1804.07795 [pdf, other]

Stochastic subgradient method converges on tame functions

Authors: Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee

Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In part… ▽ More This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures. △ Less

Submitted 25 May, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

Comments: 32 pages, 1 figure

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1803.06523 [pdf, other]

Stochastic model-based minimization of weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal… ▽ More We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set. △ Less

Submitted 26 August, 2018; v1 submitted 17 March, 2018; originally announced March 2018.

Comments: 33 pages, 4 figures

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1803.02461 [pdf, other]

Subgradient methods for sharp weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy, Kellie J. MacPhee, Courtney Paquette

Abstract: Subgradient methods converge linearly on a convex function that grows sharply away from its solution set. In this work, we show that the same is true for sharp functions that are only weakly convex, provided that the subgradient methods are initialized within a fixed tube around the solution set. A variety of statistical and signal processing tasks come equipped with good initialization, and prova… ▽ More Subgradient methods converge linearly on a convex function that grows sharply away from its solution set. In this work, we show that the same is true for sharp functions that are only weakly convex, provided that the subgradient methods are initialized within a fixed tube around the solution set. A variety of statistical and signal processing tasks come equipped with good initialization, and provably lead to formulations that are both weakly convex and sharp. Therefore, in such settings, subgradient methods can serve as inexpensive local search procedures. We illustrate the proposed techniques on phase retrieval and covariance estimation problems. △ Less

Submitted 6 March, 2018; originally announced March 2018.

Comments: 16 pages, 3 figures

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1802.08556 [pdf, ps, other]

Complexity of finding near-stationary points of convex functions stochastically

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: In a recent paper, we showed that the stochastic subgradient method applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. In this supplementary note, we present a stochastic subgradient method for minimizing a convex function, with the improved rate $\widetilde O(k^{-1/2})$. In a recent paper, we showed that the stochastic subgradient method applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. In this supplementary note, we present a stochastic subgradient method for minimizing a convex function, with the improved rate $\widetilde O(k^{-1/2})$. △ Less

Submitted 21 February, 2018; originally announced February 2018.

Comments: 9 pages

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1802.02988 [pdf, ps, other]

Stochastic subgradient method converges at the rate $O(k^{-1/4})$ on weakly convex functions

Authors: Damek Davis, Dmitriy Drusvyatskiy

Abstract: We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function. We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function. △ Less

Submitted 19 February, 2018; v1 submitted 8 February, 2018; originally announced February 2018.

Comments: 12 pages

MSC Class: 65K05; 65K10; 90C15; 90C30

arXiv:1712.06038 [pdf, ps, other]

The proximal point method revisited

Authors: Dmitriy Drusvyatskiy

Abstract: In this short survey, I revisit the role of the proximal point method in large scale optimization. I focus on three recent examples: a proximally guided subgradient method for weakly convex stochastic approximation, the prox-linear algorithm for minimizing compositions of convex functions and smooth maps, and Catalyst generic acceleration for regularized Empirical Risk Minimization. In this short survey, I revisit the role of the proximal point method in large scale optimization. I focus on three recent examples: a proximally guided subgradient method for weakly convex stochastic approximation, the prox-linear algorithm for minimizing compositions of convex functions and smooth maps, and Catalyst generic acceleration for regularized Empirical Risk Minimization. △ Less

Submitted 16 December, 2017; originally announced December 2017.

Comments: 11 pages, submitted to SIAG/OPT Views and News

MSC Class: 65K05; 90C06; 90C25; 90C30

arXiv:1711.03247 [pdf, other]

The nonsmooth landscape of phase retrieval

Authors: Damek Davis, Dmitriy Drusvyatskiy, Courtney Paquette

Abstract: We consider a popular nonsmooth formulation of the real phase retrieval problem. We show that under standard statistical assumptions, a simple subgradient method converges linearly when initialized within a constant relative distance of an optimal solution. Seeking to understand the distribution of the stationary points of the problem, we complete the paper by proving that as the number of Gaussia… ▽ More We consider a popular nonsmooth formulation of the real phase retrieval problem. We show that under standard statistical assumptions, a simple subgradient method converges linearly when initialized within a constant relative distance of an optimal solution. Seeking to understand the distribution of the stationary points of the problem, we complete the paper by proving that as the number of Gaussian measurements increases, the stationary points converge to a codimension two set, at a controlled rate. Experiments on image recovery problems illustrate the developed algorithm and theory. △ Less

Submitted 6 January, 2018; v1 submitted 8 November, 2017; originally announced November 2017.

Comments: 42 Pages, 15 figures

MSC Class: 65K10; 90C06

arXiv:1706.03705 [pdf, other]

The many faces of degeneracy in conic optimization

Authors: Dmitriy Drusvyatskiy, Henry Wolkowicz

Abstract: Slater's condition -- existence of a "strictly feasible solution" -- is a common assumption in conic optimization. Without strict feasibility, first-order optimality conditions may be meaningless, the dual problem may yield little information about the primal, and small changes in the data may render the problem infeasible. Hence, failure of strict feasibility can negatively impact off-the-shelf n… ▽ More Slater's condition -- existence of a "strictly feasible solution" -- is a common assumption in conic optimization. Without strict feasibility, first-order optimality conditions may be meaningless, the dual problem may yield little information about the primal, and small changes in the data may render the problem infeasible. Hence, failure of strict feasibility can negatively impact off-the-shelf numerical methods, such as primal-dual interior point methods, in particular. New optimization modelling techniques and convex relaxations for hard nonconvex problems have shown that the loss of strict feasibility is a more pronounced phenomenon than has previously been realized. In this text, we describe various reasons for the loss of strict feasibility, whether due to poor modelling choices or (more interestingly) rich underlying structure, and discuss ways to cope with it and, in many pronounced cases, how to use it as an advantage. In large part, we emphasize the facial reduction preprocessing technique due to its mathematical elegance, geometric transparency, and computational potential. △ Less

Submitted 12 June, 2017; originally announced June 2017.

Comments: 99 pages, 5 figures, 2 tables

arXiv:1703.10993 [pdf, other]

Catalyst Acceleration for Gradient-Based Non-Convex Optimization

Authors: Courtney Paquette, Hongzhou Lin, Dmitriy Drusvyatskiy, Julien Mairal, Zaid Harchaoui

Abstract: We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and sign… ▽ More We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and signal processing. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. These properties are achieved without assuming any knowledge about the convexity of the objective, by automatically adapting to the unknown weak convexity constant. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks. △ Less

Submitted 31 December, 2018; v1 submitted 31 March, 2017; originally announced March 2017.

arXiv:1702.08649 [pdf, other]

Foundations of gauge and perspective duality

Authors: Alexandre Y. Aravkin, James V. Burke, Dmitriy Drusvyatskiy, Michael P. Friedlander, Kellie MacPhee

Abstract: We revisit the foundations of gauge duality and demonstrate that it can be explained using a modern approach to duality based on a perturbation framework. We therefore put gauge duality and Fenchel-Rockafellar duality on equal footing, including explaining gauge dual variables as sensitivity measures, and showing how to recover primal solutions from those of the gauge dual. This vantage point allo… ▽ More We revisit the foundations of gauge duality and demonstrate that it can be explained using a modern approach to duality based on a perturbation framework. We therefore put gauge duality and Fenchel-Rockafellar duality on equal footing, including explaining gauge dual variables as sensitivity measures, and showing how to recover primal solutions from those of the gauge dual. This vantage point allows a direct proof that optimal solutions of the Fenchel-Rockafellar dual of the gauge dual are precisely the primal solutions rescaled by the optimal value. We extend the gauge duality framework to the setting in which the functional components are general nonnegative convex functions, including problems with piecewise linear quadratic functions and constraints that arise from generalized linear models used in regression. △ Less

Submitted 18 June, 2018; v1 submitted 28 February, 2017; originally announced February 2017.

Comments: 29 pages

arXiv:1610.03446 [pdf, ps, other]

Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria

Authors: Dmitriy Drusvyatskiy, Alexander D. Ioffe, Adrian S. Lewis

Abstract: We consider optimization algorithms that successively minimize simple Taylor-like models of the objective function. Methods of Gauss-Newton type for minimizing the composition of a convex function and a smooth map are common examples. Our main result is an explicit relationship between the step-size of any such algorithm and the slope of the function at a nearby point. Consequently, we (1) show th… ▽ More We consider optimization algorithms that successively minimize simple Taylor-like models of the objective function. Methods of Gauss-Newton type for minimizing the composition of a convex function and a smooth map are common examples. Our main result is an explicit relationship between the step-size of any such algorithm and the slope of the function at a nearby point. Consequently, we (1) show that the step-sizes can be reliably used to terminate the algorithm, (2) prove that as long as the step-sizes tend to zero, every limit point of the iterates is stationary, and (3) show that conditions, akin to classical quadratic growth, imply that the step-sizes linearly bound the distance of the iterates to the solution set. The latter so-called error bound property is typically used to establish linear (or faster) convergence guarantees. Analogous results hold when the step-size is replaced by the square root of the decrease in the model's value. We complete the paper with extensions to when the models are minimized only inexactly. △ Less

Submitted 11 October, 2016; originally announced October 2016.

Comments: 23 pages

MSC Class: 65K05; 90C30; 49M37; 65K10

arXiv:1606.02395 [pdf, ps, other]

Efficient quadratic penalization through the partial minimization technique

Authors: Aleksandr Y. Aravkin, Dmitriy Drusvyatskiy, Tristan van Leeuwen

Abstract: Common computational problems, such as parameter estimation in dynamic models and PDE constrained optimization, require data fitting over a set of auxiliary parameters subject to physical constraints over an underlying state. Naive quadratically penalized formulations, commonly used in practice, suffer from inherent ill-conditioning. We show that surprisingly the partial minimization technique reg… ▽ More Common computational problems, such as parameter estimation in dynamic models and PDE constrained optimization, require data fitting over a set of auxiliary parameters subject to physical constraints over an underlying state. Naive quadratically penalized formulations, commonly used in practice, suffer from inherent ill-conditioning. We show that surprisingly the partial minimization technique regularizes the problem, making it well-conditioned. This viewpoint sheds new light on variable projection techniques, as well as the penalty method for PDE constrained optimization, and motivates robust extensions. In addition, we outline an inexact analysis, showing that the partial minimization subproblem can be solved very loosely in each iteration. We illustrate the theory and algorithms on boundary control, optimal transport, and parameter estimation for robust dynamic inference. △ Less

Submitted 17 September, 2017; v1 submitted 8 June, 2016; originally announced June 2016.

Comments: 8 pages, 9 figures

MSC Class: 65K05; 65K10; 86-08

arXiv:1605.00125 [pdf, ps, other]

Efficiency of minimizing compositions of convex functions and smooth maps

Authors: Dmitriy Drusvyatskiy, Courtney Paquette

Abstract: We consider global efficiency of algorithms for minimizing a sum of a convex function and a composition of a Lipschitz convex function with a smooth map. The basic algorithm we rely on is the prox-linear method, which in each iteration solves a regularized subproblem formed by linearizing the smooth map. When the subproblems are solved exactly, the method has efficiency… ▽ More We consider global efficiency of algorithms for minimizing a sum of a convex function and a composition of a Lipschitz convex function with a smooth map. The basic algorithm we rely on is the prox-linear method, which in each iteration solves a regularized subproblem formed by linearizing the smooth map. When the subproblems are solved exactly, the method has efficiency $\mathcal{O}(\varepsilon^{-2})$, akin to gradient descent for smooth minimization. We show that when the subproblems can only be solved by first-order methods, a simple combination of smoothing, the prox-linear method, and a fast-gradient scheme yields an algorithm with complexity $\widetilde{\mathcal{O}}(\varepsilon^{-3})$. The technique readily extends to minimizing an average of $m$ composite functions, with complexity $\widetilde{\mathcal{O}}(m/\varepsilon^{2}+\sqrt{m}/\varepsilon^{3})$ in expectation. We round off the paper with an inertial prox-linear method that automatically accelerates in presence of convexity. △ Less

Submitted 14 August, 2017; v1 submitted 30 April, 2016; originally announced May 2016.

MSC Class: 97N60; 90C25; 90C06; 90C30

arXiv:1604.06543 [pdf, other]

An optimal first order method based on optimal quadratic averaging

Authors: Dmitriy Drusvyatskiy, Maryam Fazel, Scott Roy

Abstract: In a recent paper, Bubeck, Lee, and Singh introduced a new first order method for minimizing smooth strongly convex functions. Their geometric descent algorithm, largely inspired by the ellipsoid method, enjoys the optimal linear rate of convergence. We show that the same iterate sequence is generated by a scheme that in each iteration computes an optimal average of quadratic lower-models of the f… ▽ More In a recent paper, Bubeck, Lee, and Singh introduced a new first order method for minimizing smooth strongly convex functions. Their geometric descent algorithm, largely inspired by the ellipsoid method, enjoys the optimal linear rate of convergence. We show that the same iterate sequence is generated by a scheme that in each iteration computes an optimal average of quadratic lower-models of the function. Indeed, the minimum of the averaged quadratic approaches the true minimum at an optimal rate. This intuitive viewpoint reveals clear connections to the original fast-gradient methods and cutting plane ideas, and leads to limited-memory extensions with improved performance. △ Less

Submitted 28 February, 2017; v1 submitted 22 April, 2016; originally announced April 2016.

Comments: 23 pages

MSC Class: 90C25; 90C06

arXiv:1602.06661 [pdf, ps, other]

Error bounds, quadratic growth, and linear convergence of proximal methods

Authors: Dmitriy Drusvyatskiy, Adrian S. Lewis

Abstract: The proximal gradient algorithm for minimizing the sum of a smooth and a nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the "error" -- the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error boun… ▽ More The proximal gradient algorithm for minimizing the sum of a smooth and a nonsmooth convex function often converges linearly even without strong convexity. One common reason is that a multiple of the step length at each iteration may linearly bound the "error" -- the distance to the solution set. We explain the observed linear convergence intuitively by proving the equivalence of such an error bound to a natural quadratic growth condition. Our approach generalizes to linear convergence analysis for proximal methods (of Gauss-Newton type) for minimizing compositions of nonsmooth functions with smooth mappings. We observe incidentally that short step-lengths in the algorithm indicate near-stationarity, suggesting a reliable termination criterion. △ Less

Submitted 27 June, 2016; v1 submitted 22 February, 2016; originally announced February 2016.

Comments: 35 pages

MSC Class: 90C25; 90C31; 90C55; 65K10

arXiv:1602.01506 [pdf, other]

Level-set methods for convex optimization

Authors: Aleksandr Y. Aravkin, James V. Burke, Dmitriy Drusvyatskiy, Michael P. Friedlander, Scott Roy

Abstract: Convex optimization problems arising in applications often have favorable objective functions and complicated constraints, thereby precluding first-order methods from being immediately applicable. We describe an approach that exchanges the roles of the objective and constraint functions, and instead approximately solves a sequence of parametric level-set problems. A zero-finding procedure, based o… ▽ More Convex optimization problems arising in applications often have favorable objective functions and complicated constraints, thereby precluding first-order methods from being immediately applicable. We describe an approach that exchanges the roles of the objective and constraint functions, and instead approximately solves a sequence of parametric level-set problems. A zero-finding procedure, based on inexact function evaluations and possibly inexact derivative information, leads to an efficient solution scheme for the original problem. We describe the theoretical and practical properties of this approach for a broad range of problems, including low-rank semidefinite optimization, sparse optimization, and generalized linear models for inference. △ Less

Submitted 3 February, 2016; originally announced February 2016.

Comments: 38 pages

arXiv:1601.07210 [pdf, ps, other]

The Euclidean Distance Degree of Orthogonally Invariant Matrix Varieties

Authors: Dmitriy Drusvyatskiy, Hon-Leung Lee, Giorgio Ottaviani, Rekha R. Thomas

Abstract: We show that the Euclidean distance degree of a real orthogonally invariant matrix variety equals the Euclidean distance degree of its restriction to diagonal matrices. We illustrate how this result can greatly simplify calculations in concrete circumstances. We show that the Euclidean distance degree of a real orthogonally invariant matrix variety equals the Euclidean distance degree of its restriction to diagonal matrices. We illustrate how this result can greatly simplify calculations in concrete circumstances. △ Less

Submitted 26 January, 2016; originally announced January 2016.

Comments: 18 pages

MSC Class: 90C26; 15A18; 14R20; 14N10

arXiv:1506.05170 [pdf, ps, other]

Variational analysis of spectral functions simplified

Authors: D. Drusvyatskiy, C. Kempton

Abstract: Spectral functions of symmetric matrices -- those depending on matrices only through their eigenvalues -- appear often in optimization. A cornerstone variational analytic tool for studying such functions is a formula relating their subdifferentials to the subdifferentials of their diagonal restrictions. This paper presents a new, short, and revealing derivation of this result. We then round off th… ▽ More Spectral functions of symmetric matrices -- those depending on matrices only through their eigenvalues -- appear often in optimization. A cornerstone variational analytic tool for studying such functions is a formula relating their subdifferentials to the subdifferentials of their diagonal restrictions. This paper presents a new, short, and revealing derivation of this result. We then round off the paper with an illuminating derivation of the second derivative of twice differentiable spectral functions, highlighting the underlying geometry. All of our arguments have direct analogues for spectral functions of Hermitian matrices, and for singular value functions of rectangular matrices. △ Less

Submitted 22 July, 2015; v1 submitted 16 June, 2015; originally announced June 2015.

Comments: 17 pages

MSC Class: 49J52; 15A18; 49J53; 49R05; 58D19

Showing 1–50 of 71 results for author: Drusvyatskiy, D