-
Iteratively reweighted kernel machines efficiently learn sparse functions
Authors:
Libin Zhu,
Damek Davis,
Dmitriy Drusvyatskiy,
Maryam Fazel
Abstract:
The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influ…
▽ More
The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influential coordinates with low sample complexity. Moreover, by iteratively using the derivatives to reweight the data and retrain kernel machines, one is able to efficiently learn hierarchical polynomials with finite leap complexity. Numerical experiments illustrate the developed theory.
△ Less
Submitted 13 May, 2025;
originally announced May 2025.
-
Online Covariance Estimation in Nonsmooth Stochastic Approximation
Authors:
Liwei Jiang,
Abhishek Roy,
Krishna Balasubramanian,
Damek Davis,
Dmitriy Drusvyatskiy,
Sen Na
Abstract:
We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potenti…
▽ More
We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al.(2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth
Authors:
Damek Davis,
Dmitriy Drusvyatskiy,
Liwei Jiang
Abstract:
A prevalent belief among optimization specialists is that linear convergence of gradient descent is contingent on the function growing quadratically away from its minimizers. In this work, we argue that this belief is inaccurate. We show that gradient descent with an adaptive stepsize converges at a local (nearly) linear rate on any smooth function that merely exhibits fourth-order growth away fro…
▽ More
A prevalent belief among optimization specialists is that linear convergence of gradient descent is contingent on the function growing quadratically away from its minimizers. In this work, we argue that this belief is inaccurate. We show that gradient descent with an adaptive stepsize converges at a local (nearly) linear rate on any smooth function that merely exhibits fourth-order growth away from its minimizer. The adaptive stepsize we propose arises from an intriguing decomposition theorem: any such function admits a smooth manifold around the optimal solution -- which we call the ravine -- so that the function grows at least quadratically away from the ravine and has constant order growth along it. The ravine allows one to interlace many short gradient steps with a single long Polyak gradient step, which together ensure rapid convergence to the minimizer. We illustrate the theory and algorithm on the problems of matrix sensing and factorization and learning a single neuron in the overparameterized regime.
△ Less
Submitted 29 September, 2024;
originally announced September 2024.
-
Linear Recursive Feature Machines provably recover low-rank matrices
Authors:
Adityanarayanan Radhakrishnan,
Mikhail Belkin,
Dmitriy Drusvyatskiy
Abstract:
A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a…
▽ More
A fundamental problem in machine learning is to understand how neural networks make accurate predictions, while seemingly bypassing the curse of dimensionality. A possible explanation is that common training algorithms for neural networks implicitly perform dimensionality reduction - a process called feature learning. Recent work posited that the effects of feature learning can be elicited from a classical statistical estimator called the average gradient outer product (AGOP). The authors proposed Recursive Feature Machines (RFMs) as an algorithm that explicitly performs feature learning by alternating between (1) reweighting the feature vectors by the AGOP and (2) learning the prediction function in the transformed space. In this work, we develop the first theoretical guarantees for how RFM performs dimensionality reduction by focusing on the class of overparametrized problems arising in sparse linear regression and low-rank matrix recovery. Specifically, we show that RFM restricted to linear models (lin-RFM) generalizes the well-studied Iteratively Reweighted Least Squares (IRLS) algorithm. Our results shed light on the connection between feature learning in neural networks and classical sparse recovery algorithms. In addition, we provide an implementation of lin-RFM that scales to matrices with millions of missing entries. Our implementation is faster than the standard IRLS algorithm as it is SVD-free. It also outperforms deep linear networks for sparse linear regression and low-rank matrix completion.
△ Less
Submitted 9 January, 2024;
originally announced January 2024.
-
Aiming towards the minimizers: fast convergence of SGD for overparametrized problems
Authors:
Chaoyue Liu,
Dmitriy Drusvyatskiy,
Mikhail Belkin,
Damek Davis,
Yi-An Ma
Abstract:
Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method,…
▽ More
Modern machine learning paradigms, such as deep learning, occur in or close to the interpolation regime, wherein the number of model parameters is much larger than the number of data samples. In this work, we propose a regularity condition within the interpolation regime which endows the stochastic gradient method with the same worst-case iteration complexity as the deterministic gradient method, while using only a single sampled gradient (or a minibatch) in each iteration. In contrast, all existing guarantees require the stochastic gradient method to take small steps, thereby resulting in a much slower linear rate of convergence. Finally, we demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Stochastic Approximation with Decision-Dependent Distributions: Asymptotic Normality and Optimality
Authors:
Joshua Cutler,
Mateo Díaz,
Dmitriy Drusvyatskiy
Abstract:
We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotica…
▽ More
We analyze a stochastic approximation algorithm for decision-dependent problems, wherein the data distribution used by the algorithm evolves along the iterate sequence. The primary examples of such problems appear in performative prediction and its multiplayer extensions. We show that under mild assumptions, the deviation between the average iterate of the algorithm and the solution is asymptotically normal, with a covariance that clearly decouples the effects of the gradient noise and the distributional shift. Moreover, building on the work of Hájek and Le Cam, we show that the asymptotic performance of the algorithm with averaging is locally minimax optimal.
△ Less
Submitted 13 March, 2024; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Decision-Dependent Risk Minimization in Geometrically Decaying Dynamic Environments
Authors:
Mitas Ray,
Dmitriy Drusvyatskiy,
Maryam Fazel,
Lillian J. Ratliff
Abstract:
This paper studies the problem of expected loss minimization given a data distribution that is dependent on the decision-maker's action and evolves dynamically in time according to a geometric decay process. Novel algorithms for both the information setting in which the decision-maker has a first order gradient oracle and the setting in which they have simply a loss function oracle are introduced.…
▽ More
This paper studies the problem of expected loss minimization given a data distribution that is dependent on the decision-maker's action and evolves dynamically in time according to a geometric decay process. Novel algorithms for both the information setting in which the decision-maker has a first order gradient oracle and the setting in which they have simply a loss function oracle are introduced. The algorithms operate on the same underlying principle: the decision-maker repeatedly deploys a fixed decision over the length of an epoch, thereby allowing the dynamically changing environment to sufficiently mix before updating the decision. The iteration complexity in each of the settings is shown to match existing rates for first and zero order stochastic gradient methods up to logarithmic factors. The algorithms are evaluated on a "semi-synthetic" example using real world data from the SFpark dynamic pricing pilot study; it is shown that the announced prices result in an improvement for the institution's objective (target occupancy), while achieving an overall reduction in parking rates.
△ Less
Submitted 8 April, 2022;
originally announced April 2022.
-
Flat minima generalize for low-rank matrix recovery
Authors:
Lijun Ding,
Dmitriy Drusvyatskiy,
Maryam Fazel,
Zaid Harchaoui
Abstract:
Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameter…
▽ More
Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We conclude with synthetic experiments that illustrate our findings and discuss the effect of depth on flat solutions.
△ Less
Submitted 17 February, 2023; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Multiplayer Performative Prediction: Learning in Decision-Dependent Games
Authors:
Adhyyan Narang,
Evan Faulkner,
Dmitriy Drusvyatskiy,
Maryam Fazel,
Lillian J. Ratliff
Abstract:
Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called "multi-player performative prediction". We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter…
▽ More
Learning problems commonly exhibit an interesting feedback mechanism wherein the population data reacts to competing decision makers' actions. This paper formulates a new game theoretic framework for this phenomenon, called "multi-player performative prediction". We focus on two distinct solution concepts, namely (i) performatively stable equilibria and (ii) Nash equilibria of the game. The latter equilibria are arguably more informative, but can be found efficiently only when the game is monotone. We show that under mild assumptions, the performatively stable equilibria can be found efficiently by a variety of algorithms, including repeated retraining and the repeated (stochastic) gradient method. We then establish transparent sufficient conditions for strong monotonicity of the game and use them to develop algorithms for finding Nash equilibria. We investigate derivative free methods and adaptive gradient algorithms wherein each player alternates between learning a parametric description of their distribution and gradient steps on the empirical risk. Synthetic and semi-synthetic numerical experiments illustrate the results.
△ Less
Submitted 6 April, 2022; v1 submitted 10 January, 2022;
originally announced January 2022.
-
Improved Rates for Derivative Free Gradient Play in Strongly Monotone Games
Authors:
Dmitriy Drusvyatskiy,
Maryam Fazel,
Lillian J Ratliff
Abstract:
The influential work of Bravo et al. 2018 shows that derivative free play in strongly monotone games has complexity $O(d^2/\varepsilon^3)$, where $\varepsilon$ is the target accuracy on the expected squared distance to the solution. This note shows that the efficiency estimate is actually $O(d^2/\varepsilon^2)$, which reduces to the known efficiency guarantee for the method in unconstrained optimi…
▽ More
The influential work of Bravo et al. 2018 shows that derivative free play in strongly monotone games has complexity $O(d^2/\varepsilon^3)$, where $\varepsilon$ is the target accuracy on the expected squared distance to the solution. This note shows that the efficiency estimate is actually $O(d^2/\varepsilon^2)$, which reduces to the known efficiency guarantee for the method in unconstrained optimization. The argument we present simple interprets the method as stochastic gradient play on a slightly perturbed strongly monotone game.
△ Less
Submitted 6 April, 2022; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Active manifolds, stratifications, and convergence to local minima in nonsmooth optimization
Authors:
Damek Davis,
Dmitriy Drusvyatskiy,
Liwei Jiang
Abstract:
We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that capt…
▽ More
We show that the subgradient method converges only to local minimizers when applied to generic Lipschitz continuous and subdifferentially regular functions that are definable in an o-minimal structure. At a high level, the argument we present is appealingly transparent: we interpret the nonsmooth dynamics as an approximate Riemannian gradient method on a certain distinguished submanifold that captures the nonsmooth activity of the function. In the process, we develop new regularity conditions in nonsmooth analysis that parallel the stratification conditions of Whitney, Kuo, and Verdier and extend stochastic processes techniques of Pemantle.
△ Less
Submitted 9 January, 2023; v1 submitted 26 August, 2021;
originally announced August 2021.
-
Stochastic Optimization under Distributional Drift
Authors:
Joshua Cutler,
Dmitriy Drusvyatskiy,
Zaid Harchaoui
Abstract:
We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic converg…
▽ More
We consider the problem of minimizing a convex function that is evolving according to unknown and possibly stochastic dynamics, which may depend jointly on time and on the decision variable itself. Such problems abound in the machine learning and signal processing literature, under the names of concept drift, stochastic tracking, and performative prediction. We provide novel non-asymptotic convergence guarantees for stochastic algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. The efficiency estimates we obtain clearly decouple the contributions of optimization error, gradient noise, and time drift. Notably, we identify a low drift-to-noise regime in which the tracking efficiency of the proximal stochastic gradient method benefits significantly from a step decay schedule. Numerical experiments illustrate our results.
△ Less
Submitted 26 May, 2023; v1 submitted 16 August, 2021;
originally announced August 2021.
-
Escaping strict saddle points of the Moreau envelope in nonsmooth optimization
Authors:
Damek Davis,
Mateo Díaz,
Dmitriy Drusvyatskiy
Abstract:
Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict sad…
▽ More
Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Proximal methods avoid active strict saddles of weakly convex functions
Authors:
Damek Davis,
Dmitriy Drusvyatskiy
Abstract:
We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.
We introduce a geometrically transparent strict saddle property for nonsmooth functions. This property guarantees that simple proximal algorithms on weakly convex problems converge only to local minimizers, when randomly initialized. We argue that the strict saddle property may be a realistic assumption in applications, since it provably holds for generic semi-algebraic optimization problems.
△ Less
Submitted 16 February, 2021; v1 submitted 15 December, 2019;
originally announced December 2019.
-
From low probability to high confidence in stochastic convex optimization
Authors:
Damek Davis,
Dmitriy Drusvyatskiy,
Lin Xiao,
Junyu Zhang
Abstract:
Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strong…
▽ More
Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.
△ Less
Submitted 16 October, 2019; v1 submitted 31 July, 2019;
originally announced July 2019.
-
Stochastic algorithms with geometric step decay converge linearly on sharp functions
Authors:
Damek Davis,
Dmitriy Drusvyatskiy,
Vasileios Charisopoulos
Abstract:
Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work,…
▽ More
Stochastic (sub)gradient methods require step size schedule tuning to perform well in practice. Classical tuning strategies decay the step size polynomially and lead to optimal sublinear rates on (strongly) convex problems. An alternative schedule, popular in nonconvex optimization, is called \emph{geometric step decay} and proceeds by halving the step size after every few epochs. In recent work, geometric step decay was shown to improve exponentially upon classical sublinear rates for the class of \emph{sharp} convex functions. In this work, we ask whether geometric step decay similarly improves stochastic algorithms for the class of sharp nonconvex problems. Such losses feature in modern statistical recovery problems and lead to a new challenge not present in the convex setting: the region of convergence is local, so one must bound the probability of escape. Our main result shows that for a large class of stochastic, sharp, nonsmooth, and nonconvex problems a geometric step decay schedule endows well-known algorithms with a local linear rate of convergence to global minimizers. This guarantee applies to the stochastic projected subgradient, proximal point, and prox-linear algorithms. As an application of our main result, we analyze two statistical recovery tasks---phase retrieval and blind deconvolution---and match the best known guarantees under Gaussian measurement models and establish new guarantees under heavy-tailed distributions.
△ Less
Submitted 22 July, 2019;
originally announced July 2019.
-
Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence
Authors:
Vasileios Charisopoulos,
Yudong Chen,
Damek Davis,
Mateo Díaz,
Lijun Ding,
Dmitriy Drusvyatskiy
Abstract:
The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations d…
▽ More
The task of recovering a low-rank matrix from its noisy linear measurements plays a central role in computational science. Smooth formulations of the problem often exhibit an undesirable phenomenon: the condition number, classically defined, scales poorly with the dimension of the ambient space. In contrast, we here show that in a variety of concrete circumstances, nonsmooth penalty formulations do not suffer from the same type of ill-conditioning. Consequently, standard algorithms for nonsmooth optimization, such as subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. Moreover, nonsmooth formulations are naturally robust against outliers. Our framework subsumes such important computational tasks as phase retrieval, blind deconvolution, quadratic sensing, matrix completion, and robust PCA. Numerical experiments on these problems illustrate the benefits of the proposed approach.
△ Less
Submitted 22 April, 2019;
originally announced April 2019.
-
Composite optimization for robust blind deconvolution
Authors:
Vasileios Charisopoulos,
Damek Davis,
Mateo Díaz,
Dmitriy Drusvyatskiy
Abstract:
The blind deconvolution problem seeks to recover a pair of vectors from a set of rank one bilinear measurements. We consider a natural nonsmooth formulation of the problem and show that under standard statistical assumptions, its moduli of weak convexity, sharpness, and Lipschitz continuity are all dimension independent. This phenomenon persists even when up to half of the measurements are corrupt…
▽ More
The blind deconvolution problem seeks to recover a pair of vectors from a set of rank one bilinear measurements. We consider a natural nonsmooth formulation of the problem and show that under standard statistical assumptions, its moduli of weak convexity, sharpness, and Lipschitz continuity are all dimension independent. This phenomenon persists even when up to half of the measurements are corrupted by noise. Consequently, standard algorithms, such as the subgradient and prox-linear methods, converge at a rapid dimension-independent rate when initialized within constant relative error of the solution. We then complete the paper with a new initialization strategy, complementing the local search algorithms. The initialization procedure is both provably efficient and robust to outlying measurements. Numerical experiments, on both simulated and real data, illustrate the developed theory and methods.
△ Less
Submitted 18 January, 2019; v1 submitted 6 January, 2019;
originally announced January 2019.
-
Graphical Convergence of Subgradients in Nonconvex Optimization and Learning
Authors:
Damek Davis,
Dmitriy Drusvyatskiy
Abstract:
We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-…
▽ More
We investigate the stochastic optimization problem of minimizing population risk, where the loss defining the risk is assumed to be weakly convex. Compositions of Lipschitz convex functions with smooth maps are the primary examples of such losses. We analyze the estimation quality of such nonsmooth and nonconvex problems by their sample average approximations. Our main results establish dimension-dependent rates on subgradient estimation in full generality and dimension-independent rates when the loss is a generalized linear model. As an application of the developed techniques, we analyze the nonsmooth landscape of a robust nonlinear regression problem.
△ Less
Submitted 17 December, 2018; v1 submitted 17 October, 2018;
originally announced October 2018.
-
Stochastic model-based minimization under high-order growth
Authors:
Damek Davis,
Dmitriy Drusvyatskiy,
Kellie J. MacPhee
Abstract:
Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional co…
▽ More
Given a nonsmooth, nonconvex minimization problem, we consider algorithms that iteratively sample and minimize stochastic convex models of the objective function. Assuming that the one-sided approximation quality and the variation of the models is controlled by a Bregman divergence, we show that the scheme drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. Under additional convexity and relative strong convexity assumptions, the function values converge to the minimum at the rate of $O(k^{-1/2})$ and $\widetilde{O}(k^{-1})$, respectively. We discuss consequences for stochastic proximal point, mirror descent, regularized Gauss-Newton, and saddle point algorithms.
△ Less
Submitted 30 June, 2018;
originally announced July 2018.
-
Stochastic subgradient method converges on tame functions
Authors:
Damek Davis,
Dmitriy Drusvyatskiy,
Sham Kakade,
Jason D. Lee
Abstract:
This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In part…
▽ More
This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence guarantees for a wide class of problems arising in data science---including all popular deep learning architectures.
△ Less
Submitted 25 May, 2018; v1 submitted 20 April, 2018;
originally announced April 2018.
-
Stochastic model-based minimization of weakly convex functions
Authors:
Damek Davis,
Dmitriy Drusvyatskiy
Abstract:
We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal…
▽ More
We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate $O(k^{-1/4})$. As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions with smooth maps. The guiding principle, underlying the complexity guarantees, is that all algorithms under consideration can be interpreted as approximate descent methods on an implicit smoothing of the problem, given by the Moreau envelope. Specializing to classical circumstances, we obtain the long-sought convergence rate of the stochastic projected gradient method, without batching, for minimizing a smooth function on a closed convex set.
△ Less
Submitted 26 August, 2018; v1 submitted 17 March, 2018;
originally announced March 2018.
-
Stochastic subgradient method converges at the rate $O(k^{-1/4})$ on weakly convex functions
Authors:
Damek Davis,
Dmitriy Drusvyatskiy
Abstract:
We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function.
We prove that the proximal stochastic subgradient method, applied to a weakly convex problem, drives the gradient of the Moreau envelope to zero at the rate $O(k^{-1/4})$. As a consequence, we resolve an open question on the convergence rate of the proximal stochastic gradient method for minimizing the sum of a smooth nonconvex function and a convex proximable function.
△ Less
Submitted 19 February, 2018; v1 submitted 8 February, 2018;
originally announced February 2018.
-
Complexity of a Single Face in an Arrangement of s-Intersecting Curves
Authors:
Boris Aronov,
Dmitriy Drusvyatskiy
Abstract:
Consider a face F in an arrangement of n Jordan curves in the plane, no two of which intersect more than s times. We prove that the combinatorial complexity of F is O(λ_s(n)), O(λ_{s+1}(n)), and O(λ_{s+2}(n)), when the curves are bi-infinite, semi-infinite, or bounded, respectively; λ_k(n) is the maximum length of a Davenport-Schinzel sequence of order k on an alphabet of n symbols.
Our bounds a…
▽ More
Consider a face F in an arrangement of n Jordan curves in the plane, no two of which intersect more than s times. We prove that the combinatorial complexity of F is O(λ_s(n)), O(λ_{s+1}(n)), and O(λ_{s+2}(n)), when the curves are bi-infinite, semi-infinite, or bounded, respectively; λ_k(n) is the maximum length of a Davenport-Schinzel sequence of order k on an alphabet of n symbols.
Our bounds asymptotically match the known worst-case lower bounds. Our proof settles the still apparently open case of semi-infinite curves. Moreover, it treats the three cases in a fairly uniform fashion.
△ Less
Submitted 22 August, 2011;
originally announced August 2011.