-
Towards Weaker Variance Assumptions for Stochastic Optimization
Authors:
Ahmet Alacaoglu,
Yura Malitsky,
Stephen J. Wright
Abstract:
We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship t…
▽ More
We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship to weakest-known variance assumptions for analyzing stochastic gradient algorithms, and its relevance in deterministic problems for non-Lipschitz nonsmooth convex optimization. We build on and extend a connection recently made between this assumption and the Halpern iteration. For convex nonsmooth, and potentially stochastic, optimization, we analyze horizon-free, anytime algorithms with last-iterate rates. For problems beyond simple constrained optimization, such as convex problems with functional constraints or regularized convex-concave min-max problems, we obtain rates for optimality measures that do not require boundedness of the feasible set.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
First-ish Order Methods: Hessian-aware Scalings of Gradient Descent
Authors:
Oscar Smee,
Fred Roosta,
Stephen J. Wright
Abstract:
Gradient descent is the primary workhorse for optimizing large-scale problems in machine learning. However, its performance is highly sensitive to the choice of the learning rate. A key limitation of gradient descent is its lack of natural scaling, which often necessitates expensive line searches or heuristic tuning to determine an appropriate step size. In this paper, we address this limitation b…
▽ More
Gradient descent is the primary workhorse for optimizing large-scale problems in machine learning. However, its performance is highly sensitive to the choice of the learning rate. A key limitation of gradient descent is its lack of natural scaling, which often necessitates expensive line searches or heuristic tuning to determine an appropriate step size. In this paper, we address this limitation by incorporating Hessian information to scale the gradient direction. By accounting for the curvature of the function along the gradient, our adaptive, Hessian-aware scaling method ensures a local unit step size guarantee, even in nonconvex settings. Near a local minimum that satisfies the second-order sufficient conditions, our approach achieves linear convergence with a unit step size. We show that our method converges globally under a significantly weaker version of the standard Lipschitz gradient smoothness assumption. Even when Hessian information is inexact, the local unit step size guarantee and global convergence properties remain valid under mild conditions. Finally, we validate our theoretical results empirically on a range of convex and nonconvex machine learning tasks, showcasing the effectiveness of the approach.
△ Less
Submitted 2 June, 2025; v1 submitted 5 February, 2025;
originally announced February 2025.
-
Optimal Rates for Robust Stochastic Convex Optimization
Authors:
Changyu Gao,
Andrew Lowy,
Xingyu Zhou,
Stephen J. Wright
Abstract:
Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the $ε$-contamination model, where an adversary can inspect and replace up to an $ε$-fraction of the samples, a fundamental open problem is determining the optimal rates for robust st…
▽ More
Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the $ε$-contamination model, where an adversary can inspect and replace up to an $ε$-fraction of the samples, a fundamental open problem is determining the optimal rates for robust stochastic convex optimization (SCO) under such contamination. We develop novel algorithms that achieve minimax-optimal excess risk (up to logarithmic factors) under the $ε$-contamination model. Our approach improves over existing algorithms, which are not only suboptimal but also require stringent assumptions, including Lipschitz continuity and smoothness of individual sample functions. By contrast, our optimal algorithms do not require these stringent assumptions, assuming only population-level smoothness of the loss. Moreover, our algorithms can be adapted to handle the case in which the covariance parameter is unknown, and can be extended to nonsmooth population risks via convolutional smoothing. We complement our algorithmic developments with a tight information-theoretic lower bound for robust SCO.
△ Less
Submitted 23 April, 2025; v1 submitted 14 December, 2024;
originally announced December 2024.
-
Private Heterogeneous Federated Learning Without a Trusted Server Revisited: Error-Optimal and Communication-Efficient Algorithms for Convex Losses
Authors:
Changyu Gao,
Andrew Lowy,
Xingyu Zhou,
Stephen J. Wright
Abstract:
We revisit the problem of federated learning (FL) with private data from people who do not trust the server or other silos/clients. In this context, every silo (e.g. hospital) has data from several people (e.g. patients) and needs to protect the privacy of each person's data (e.g. health records), even if the server and/or other silos try to uncover this data. Inter-Silo Record-Level Differential…
▽ More
We revisit the problem of federated learning (FL) with private data from people who do not trust the server or other silos/clients. In this context, every silo (e.g. hospital) has data from several people (e.g. patients) and needs to protect the privacy of each person's data (e.g. health records), even if the server and/or other silos try to uncover this data. Inter-Silo Record-Level Differential Privacy (ISRL-DP) prevents each silo's data from being leaked, by requiring that silo i's communications satisfy item-level differential privacy. Prior work arXiv:2106.09779 characterized the optimal excess risk bounds for ISRL-DP algorithms with homogeneous (i.i.d.) silo data and convex loss functions. However, two important questions were left open: (1) Can the same excess risk bounds be achieved with heterogeneous (non-i.i.d.) silo data? (2) Can the optimal risk bounds be achieved with fewer communication rounds? In this paper, we give positive answers to both questions. We provide novel ISRL-DP FL algorithms that achieve the optimal excess risk bounds in the presence of heterogeneous silo data. Moreover, our algorithms are more communication-efficient than the prior state-of-the-art. For smooth loss functions, our algorithm achieves the optimal excess risk bound and has communication complexity that matches the non-private lower bound. Additionally, our algorithms are more computationally efficient than the previous state-of-the-art.
△ Less
Submitted 6 September, 2024; v1 submitted 12 July, 2024;
originally announced July 2024.
-
Robust Second-Order Nonconvex Optimization and Its Application to Low Rank Matrix Sensing
Authors:
Shuyao Li,
Yu Cheng,
Ilias Diakonikolas,
Jelena Diakonikolas,
Rong Ge,
Stephen J. Wright
Abstract:
Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings.
In this paper, we study the problem of finding SOSPs in the strong c…
▽ More
Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings.
In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with \emph{dimension-independent} accuracy guarantees, using $\widetilde{O}({D^2}/ε)$ samples where $D$ is the ambient dimension and $ε$ is the fraction of corrupted datapoints.
As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
How to Make the Gradients Small Privately: Improved Rates for Differentially Private Non-Convex Optimization
Authors:
Andrew Lowy,
Jonathan Ullman,
Stephen J. Wright
Abstract:
We provide a simple and flexible framework for designing differentially private algorithms to find approximate stationary points of non-convex loss functions. Our framework is based on using a private approximate risk minimizer to "warm start" another private algorithm for finding stationary points. We use this framework to obtain improved, and sometimes optimal, rates for several classes of non-c…
▽ More
We provide a simple and flexible framework for designing differentially private algorithms to find approximate stationary points of non-convex loss functions. Our framework is based on using a private approximate risk minimizer to "warm start" another private algorithm for finding stationary points. We use this framework to obtain improved, and sometimes optimal, rates for several classes of non-convex loss functions. First, we obtain improved rates for finding stationary points of smooth non-convex empirical loss functions. Second, we specialize to quasar-convex functions, which generalize star-convex functions and arise in learning dynamical systems and training some neural nets. We achieve the optimal rate for this class. Third, we give an optimal algorithm for finding stationary points of functions satisfying the Kurdyka-Lojasiewicz (KL) condition. For example, over-parameterized neural networks often satisfy this condition. Fourth, we provide new state-of-the-art rates for stationary points of non-convex population loss functions. Fifth, we obtain improved rates for non-convex generalized linear models. A modification of our algorithm achieves nearly the same rates for second-order stationary points of functions with Lipschitz Hessian, improving over the previous state-of-the-art for each of the above problems.
△ Less
Submitted 19 August, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Revisiting Inexact Fixed-Point Iterations for Min-Max Problems: Stochasticity and Structured Nonconvexity
Authors:
Ahmet Alacaoglu,
Donghwan Kim,
Stephen J. Wright
Abstract:
We focus on constrained, $L$-smooth, potentially stochastic and nonconvex-nonconcave min-max problems either satisfying $ρ$-cohypomonotonicity or admitting a solution to the $ρ$-weakly Minty Variational Inequality (MVI), where larger values of the parameter $ρ>0$ correspond to a greater degree of nonconvexity. These problem classes include examples in two player reinforcement learning, interaction…
▽ More
We focus on constrained, $L$-smooth, potentially stochastic and nonconvex-nonconcave min-max problems either satisfying $ρ$-cohypomonotonicity or admitting a solution to the $ρ$-weakly Minty Variational Inequality (MVI), where larger values of the parameter $ρ>0$ correspond to a greater degree of nonconvexity. These problem classes include examples in two player reinforcement learning, interaction dominant min-max problems, and certain synthetic test problems on which classical min-max algorithms fail. It has been conjectured that first-order methods can tolerate a value of $ρ$ no larger than $\frac{1}{L}$, but existing results in the literature have stagnated at the tighter requirement $ρ< \frac{1}{2L}$. With a simple argument, we obtain optimal or best-known complexity guarantees with cohypomonotonicity or weak MVI conditions for $ρ< \frac{1}{L}$. First main insight for the improvements in the convergence analyses is to harness the recently proposed $\textit{conic nonexpansiveness}$ property of operators. Second, we provide a refined analysis for inexact Halpern iteration that relaxes the required inexactness level to improve some state-of-the-art complexity results even for constrained stochastic convex-concave min-max problems. Third, we analyze a stochastic inexact Krasnosel'skiĭ-Mann iteration with a multilevel Monte Carlo estimator when the assumptions only hold with respect to a solution.
△ Less
Submitted 12 August, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Complexity of Single Loop Algorithms for Nonlinear Programming with Stochastic Objective and Constraints
Authors:
Ahmet Alacaoglu,
Stephen J. Wright
Abstract:
We analyze the complexity of single-loop quadratic penalty and augmented Lagrangian algorithms for solving nonconvex optimization problems with functional equality constraints. We consider three cases, in all of which the objective is stochastic and smooth, that is, an expectation over an unknown distribution that is accessed by sampling. The nature of the equality constraints differs among the th…
▽ More
We analyze the complexity of single-loop quadratic penalty and augmented Lagrangian algorithms for solving nonconvex optimization problems with functional equality constraints. We consider three cases, in all of which the objective is stochastic and smooth, that is, an expectation over an unknown distribution that is accessed by sampling. The nature of the equality constraints differs among the three cases: deterministic and linear in the first case, deterministic, smooth and nonlinear in the second case, and stochastic, smooth and nonlinear in the third case. Variance reduction techniques are used to improve the complexity. To find a point that satisfies $\varepsilon$-approximate first-order conditions, we require $\widetilde{O}(\varepsilon^{-3})$ complexity in the first case, $\widetilde{O}(\varepsilon^{-4})$ in the second case, and $\widetilde{O}(\varepsilon^{-5})$ in the third case. For the first and third cases, they are the first algorithms of "single loop" type (that also use $O(1)$ samples at each iteration) that still achieve the best-known complexity guarantees.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
A randomized algorithm for nonconvex minimization with inexact evaluations and complexity guarantees
Authors:
Shuyao Li,
Stephen J. Wright
Abstract:
We consider minimization of a smooth nonconvex function with inexact oracle access to gradient and Hessian (without assuming access to the function value) to achieve approximate second-order optimality. A novel feature of our method is that if an approximate direction of negative curvature is chosen as the step, we choose its sense to be positive or negative with equal probability. We allow gradie…
▽ More
We consider minimization of a smooth nonconvex function with inexact oracle access to gradient and Hessian (without assuming access to the function value) to achieve approximate second-order optimality. A novel feature of our method is that if an approximate direction of negative curvature is chosen as the step, we choose its sense to be positive or negative with equal probability. We allow gradients to be inexact in a relative sense and relax the coupling between inexactness thresholds for the first- and second-order optimality conditions. Our convergence analysis includes both an expectation bound based on martingale analysis and a high-probability bound based on concentration inequalities. We apply our algorithm to empirical risk minimization problems and obtain improved gradient sample complexity over existing works.
△ Less
Submitted 26 March, 2024; v1 submitted 28 October, 2023;
originally announced October 2023.
-
Accelerating optimization over the space of probability measures
Authors:
Shi Chen,
Qin Li,
Oliver Tse,
Stephen J. Wright
Abstract:
The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this contex…
▽ More
The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this context too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples.
△ Less
Submitted 10 November, 2024; v1 submitted 6 October, 2023;
originally announced October 2023.
-
Correcting auto-differentiation in neural-ODE training
Authors:
Yewei Xu,
Shi Chen,
Qin Li,
Stephen J. Wright
Abstract:
Does the use of auto-differentiation yield reasonable updates to deep neural networks that represent neural ODEs? Through mathematical analysis and numerical evidence, we find that when the neural network employs high-order forms to approximate the underlying ODE flows (such as the Linear Multistep Method (LMM)), brute-force computation using auto-differentiation often produces non-converging arti…
▽ More
Does the use of auto-differentiation yield reasonable updates to deep neural networks that represent neural ODEs? Through mathematical analysis and numerical evidence, we find that when the neural network employs high-order forms to approximate the underlying ODE flows (such as the Linear Multistep Method (LMM)), brute-force computation using auto-differentiation often produces non-converging artificial oscillations. In the case of Leapfrog, we propose a straightforward post-processing technique that effectively eliminates these oscillations, rectifies the gradient computation and thus respects the updates of the underlying flow.
△ Less
Submitted 3 June, 2023;
originally announced June 2023.
-
Differentially Private Optimization for Smooth Nonconvex ERM
Authors:
Changyu Gao,
Stephen J. Wright
Abstract:
We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.
We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches.
△ Less
Submitted 9 June, 2023; v1 submitted 9 February, 2023;
originally announced February 2023.
-
Multi-output multilevel best linear unbiased estimators via semidefinite programming
Authors:
M. Croci,
K. E. Willcox,
S. J. Wright
Abstract:
Multifidelity forward uncertainty quantification (UQ) problems often involve multiple quantities of interest and heterogeneous models (e.g., different grids, equations, dimensions, physics, surrogate and reduced-order models). While computational efficiency is key in this context, multi-output strategies in multilevel/multifidelity methods are either sub-optimal or non-existent. In this paper we e…
▽ More
Multifidelity forward uncertainty quantification (UQ) problems often involve multiple quantities of interest and heterogeneous models (e.g., different grids, equations, dimensions, physics, surrogate and reduced-order models). While computational efficiency is key in this context, multi-output strategies in multilevel/multifidelity methods are either sub-optimal or non-existent. In this paper we extend multilevel best linear unbiased estimators (MLBLUE) to multi-output forward UQ problems and we present new semidefinite programming formulations for their optimal setup. Not only do these formulations yield the optimal number of samples required, but also the optimal selection of low-fidelity models to use. While existing MLBLUE approaches are single-output only and require a non-trivial nonlinear optimization procedure, the new multi-output formulations can be solved reliably and efficiently. We demonstrate the efficacy of the new methods and formulations in practical UQ problems with model heterogeneity.
△ Less
Submitted 15 May, 2023; v1 submitted 18 January, 2023;
originally announced January 2023.
-
Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization
Authors:
Xufeng Cai,
Chaobing Song,
Stephen J. Wright,
Jelena Diakonikolas
Abstract:
Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent prog…
▽ More
Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result when a Polyak-Łojasiewicz (PŁ) condition holds. To our knowledge, this work is the first to provide non-asymptotic convergence guarantees -- variance-reduced or not -- for a cyclic block coordinate method in general composite (smooth + nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in training deep neural nets.
△ Less
Submitted 27 January, 2023; v1 submitted 9 December, 2022;
originally announced December 2022.
-
On the Complexity of a Practical Primal-Dual Coordinate Method
Authors:
Ahmet Alacaoglu,
Volkan Cevher,
Stephen J. Wright
Abstract:
We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problem…
▽ More
We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problems.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
Coordinate Linear Variance Reduction for Generalized Linear Programming
Authors:
Chaobing Song,
Cheuk Yin Lin,
Stephen J. Wright,
Jelena Diakonikolas
Abstract:
We study a class of generalized linear programs (GLP) in a large-scale setting, which includes simple, possibly nonsmooth convex regularizer and simple convex set constraints. By reformulating (GLP) as an equivalent convex-concave min-max problem, we show that the linear structure in the problem can be used to design an efficient, scalable first-order algorithm, to which we give the name \emph{Coo…
▽ More
We study a class of generalized linear programs (GLP) in a large-scale setting, which includes simple, possibly nonsmooth convex regularizer and simple convex set constraints. By reformulating (GLP) as an equivalent convex-concave min-max problem, we show that the linear structure in the problem can be used to design an efficient, scalable first-order algorithm, to which we give the name \emph{Coordinate Linear Variance Reduction} (\textsc{clvr}; pronounced "clever"). \textsc{clvr} yields improved complexity results for (GLP) that depend on the max row norm of the linear constraint matrix in (GLP) rather than the spectral norm. When the regularization terms and constraints are separable, \textsc{clvr} admits an efficient lazy update strategy that makes its complexity bounds scale with the number of nonzero elements of the linear constraint matrix in (GLP) rather than the matrix dimensions. On the other hand, for the special case of linear programs, by exploiting sharpness, we propose a restart scheme for \textsc{clvr} to obtain empirical linear convergence. Then we show that Distributionally Robust Optimization (DRO) problems with ambiguity sets based on both $f$-divergence and Wasserstein metrics can be reformulated as (GLPs) by introducing sparsely connected auxiliary variables. We complement our theoretical guarantees with numerical experiments that verify our algorithm's practical effectiveness, in terms of wall-clock time and number of data passes.
△ Less
Submitted 6 April, 2023; v1 submitted 2 November, 2021;
originally announced November 2021.
-
Randomized Algorithms for Scientific Computing (RASC)
Authors:
Aydin Buluc,
Tamara G. Kolda,
Stefan M. Wild,
Mihai Anitescu,
Anthony DeGennaro,
John Jakeman,
Chandrika Kamath,
Ramakrishnan Kannan,
Miles E. Lopes,
Per-Gunnar Martinsson,
Kary Myers,
Jelani Nelson,
Juan M. Restrepo,
C. Seshadhri,
Draguna Vrabie,
Brendt Wohlberg,
Stephen J. Wright,
Chao Yang,
Peter Zwart
Abstract:
Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and sc…
▽ More
Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021.
△ Less
Submitted 21 March, 2022; v1 submitted 19 April, 2021;
originally announced April 2021.
-
Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums
Authors:
Chaobing Song,
Stephen J. Wright,
Jelena Diakonikolas
Abstract:
We study structured nonsmooth convex finite-sum optimization that appears widely in machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called \emph{Variance Reduction via Primal-Dual Accelerated Dual Averaging (\vrpda)}. In the nonsmooth and general convex setting, \vrpda~has t…
▽ More
We study structured nonsmooth convex finite-sum optimization that appears widely in machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called \emph{Variance Reduction via Primal-Dual Accelerated Dual Averaging (\vrpda)}. In the nonsmooth and general convex setting, \vrpda~has the overall complexity $O(nd\log\min \{1/ε, n\} + d/ε)$ in terms of the primal-dual gap, where $n$ denotes the number of samples, $d$ the dimension of the primal variables, and $ε$ the desired accuracy. In the nonsmooth and strongly convex setting, the overall complexity of \vrpda~becomes $O(nd\log\min\{1/ε, n\} + d/\sqrtε)$ in terms of both the primal-dual gap and the distance between iterate and optimal solution. Both these results for \vrpda~improve significantly on state-of-the-art complexity estimates, which are $O(nd\log \min\{1/ε, n\} + \sqrt{n}d/ε)$ for the nonsmooth and general convex setting and $O(nd\log \min\{1/ε, n\} + \sqrt{n}d/\sqrtε)$ for the nonsmooth and strongly convex setting, in a much more simple and straightforward way. Moreover, both complexities are better than \emph{lower} bounds for general convex finite sums that lack the particular (common) structure that we consider. Our theoretical results are supported by numerical experiments, which confirm the competitive performance of \vrpda~compared to state-of-the-art.
△ Less
Submitted 7 April, 2021; v1 submitted 26 February, 2021;
originally announced February 2021.
-
Random Coordinate Underdamped Langevin Monte Carlo
Authors:
Zhiyan Ding,
Qin Li,
Jianfeng Lu,
Stephen J. Wright
Abstract:
The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the…
▽ More
The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the other coordinates untouched. We investigate the computational complexity of RC-ULMC and compare it with the classical ULMC for strongly log-concave probability distributions. We show that RC-ULMC is always cheaper than the classical ULMC, with a significant cost reduction when the problem is highly skewed and high dimensional. Our complexity bound for RC-ULMC is also tight in terms of dimension dependence.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Random Coordinate Langevin Monte Carlo
Authors:
Zhiyan Ding,
Qin Li,
Jianfeng Lu,
Stephen J. Wright
Abstract:
Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the part…
▽ More
Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the partial derivative along this direction plus noise, and all other coordinates remain untouched. We investigate the total complexity of RC-LMC and compare it with the classical LMC for log-concave probability distributions. When the gradient of the log-density is Lipschitz, RC-LMC is less expensive than the classical LMC if the log-density is highly skewed for high dimensional problems, and when both the gradient and the Hessian of the log-density are Lipschitz, RC-LMC is always cheaper than the classical LMC, by a factor proportional to the square root of the problem dimension. In the latter case, our estimate of complexity is sharp with respect to the dimension.
△ Less
Submitted 3 October, 2020;
originally announced October 2020.
-
Adversarial Classification via Distributional Robustness with Wasserstein Ambiguity
Authors:
Nam Ho-Nguyen,
Stephen J. Wright
Abstract:
We study a model for adversarial classification based on distributionally robust chance constraints. We show that under Wasserstein ambiguity, the model aims to minimize the conditional value-at-risk of the distance to misclassification, and we explore links to adversarial classification models proposed earlier and to maximum-margin classifiers. We also provide a reformulation of the distributiona…
▽ More
We study a model for adversarial classification based on distributionally robust chance constraints. We show that under Wasserstein ambiguity, the model aims to minimize the conditional value-at-risk of the distance to misclassification, and we explore links to adversarial classification models proposed earlier and to maximum-margin classifiers. We also provide a reformulation of the distributionally robust model for linear classification, and show it is equivalent to minimizing a regularized ramp loss objective. Numerical experiments show that, despite the nonconvexity of this formulation, standard descent methods appear to converge to the global minimizer for this problem. Inspired by this observation, we show that, for a certain class of distributions, the only stationary point of the regularized ramp loss minimization problem is the global minimizer.
△ Less
Submitted 3 November, 2021; v1 submitted 28 May, 2020;
originally announced May 2020.
-
Interleaved Composite Quantization for High-Dimensional Similarity Search
Authors:
Soroosh Khoram,
Stephen J Wright,
Jing Li
Abstract:
Similarity search retrieves the nearest neighbors of a query vector from a dataset of high-dimensional vectors. As the size of the dataset grows, the cost of performing the distance computations needed to implement a query can become prohibitive. A method often used to reduce this computational cost is quantization of the vector space and location-based encoding of the dataset vectors. These encod…
▽ More
Similarity search retrieves the nearest neighbors of a query vector from a dataset of high-dimensional vectors. As the size of the dataset grows, the cost of performing the distance computations needed to implement a query can become prohibitive. A method often used to reduce this computational cost is quantization of the vector space and location-based encoding of the dataset vectors. These encodings can be used during query processing to find approximate nearest neighbors of the query point quickly. Search speed can be improved by using shorter codes, but shorter codes have higher quantization error, leading to degraded precision. In this work, we propose the Interleaved Composite Quantization (ICQ) which achieves fast similarity search without using shorter codes. In ICQ, a small subset of the code is used to approximate the distances, with complete codes being used only when necessary. Our method effectively reduces both code length and quantization error. Furthermore, ICQ is compatible with several recently proposed techniques for reducing quantization error and can be used in conjunction with these other techniques to improve results. We confirm these claims and show strong empirical performance of ICQ using several synthetic and real-word datasets.
△ Less
Submitted 18 December, 2019; v1 submitted 18 December, 2019;
originally announced December 2019.
-
A Distributed Quasi-Newton Algorithm for Primal and Dual Regularized Empirical Risk Minimization
Authors:
Ching-pei Lee,
Cong Han Lim,
Stephen J. Wright
Abstract:
We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving empirical risk minimization (ERM) problems with a nonsmooth regularization term. Our algorithm is applicable to both the primal and the dual ERM problem. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting…
▽ More
We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving empirical risk minimization (ERM) problems with a nonsmooth regularization term. Our algorithm is applicable to both the primal and the dual ERM problem. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations of the smooth part, and we describe how to maintain an approximation of the (generalized) Hessian and solve subproblems efficiently in a distributed manner. When applied to the distributed dual ERM problem, unlike state of the art that takes only the block-diagonal part of the Hessian, our approach is able to utilize global curvature information and is thus magnitudes faster. The proposed method enjoys global linear convergence for a broad range of non-strongly convex problems that includes the most commonly used ERMs, thus requiring lower communication complexity. It also converges on non-convex problems, so has the potential to be used on applications such as deep learning. Computational results demonstrate that our method significantly improves on communication cost and running time over the current state-of-the-art methods.
△ Less
Submitted 12 December, 2019;
originally announced December 2019.
-
A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization
Authors:
Ching-pei Lee,
Cong Han Lim,
Stephen J. Wright
Abstract:
We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving ERM problems with a nonsmooth regularization term. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations, and we…
▽ More
We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving ERM problems with a nonsmooth regularization term. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations, and we describe how to maintain an approximation of the Hessian and solve subproblems efficiently in a distributed manner. The proposed method enjoys global linear convergence for a broad range of non-strongly convex problems that includes the most commonly used ERMs, thus requiring lower communication complexity. It also converges on non-convex problems, so has the potential to be used on applications such as deep learning. Initial computational results on convex problems demonstrate that our method significantly improves on communication cost and running time over the current state-of-the-art methods.
△ Less
Submitted 26 May, 2018; v1 submitted 4 March, 2018;
originally announced March 2018.
-
Training Set Debugging Using Trusted Items
Authors:
Xuezhou Zhang,
Xiaojin Zhu,
Stephen J. Wright
Abstract:
Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for man- ual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by itself be adequate for learning, so we propose an algorithm that uses these items to identify bugs in the training set and thus im- proves learning. Specifical…
▽ More
Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for man- ual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by itself be adequate for learning, so we propose an algorithm that uses these items to identify bugs in the training set and thus im- proves learning. Specifically, our approach seeks the smallest set of changes to the training set labels such that the model learned from this corrected training set predicts labels of the trusted items correctly. We flag the items whose labels are changed as potential bugs, whose labels can be checked for veracity by human experts. To find the bugs in this way is a challenging combinatorial bilevel optimization problem, but it can be relaxed into a continuous optimization problem. Ex- periments on toy and real data demonstrate that our approach can identify training set bugs effectively and suggest appro- priate changes to the labels. Our algorithm is a step toward trustworthy machine learning.
△ Less
Submitted 24 January, 2018;
originally announced January 2018.
-
Using Neural Networks to Detect Line Outages from PMU Data
Authors:
Ching-pei Lee,
Stephen J. Wright
Abstract:
We propose an approach based on neural networks and the AC power flow equations to identify single- and double-line outages in a power grid using the information from phasor measurement unit sensors (PMUs) placed on only a subset of the buses. Rather than inferring the outage from the sensor data by inverting the physical model, our approach uses the AC model to simulate sensor responses to all ou…
▽ More
We propose an approach based on neural networks and the AC power flow equations to identify single- and double-line outages in a power grid using the information from phasor measurement unit sensors (PMUs) placed on only a subset of the buses. Rather than inferring the outage from the sensor data by inverting the physical model, our approach uses the AC model to simulate sensor responses to all outages of interest under multiple demand and seasonal conditions, and uses the resulting data to train a neural network classifier to recognize and discriminate between different outage events directly from sensor data. After training, real-time deployment of the classifier requires just a few matrix-vector products and simple vector operations. These operations can be executed much more rapidly than inversion of a model based on AC power flow, which consists of nonlinear equations and possibly integer / binary variables representing line outages, as well as the variables representing voltages and power flows. We are motivated to use neural network by its successful application to such areas as computer vision and natural language processing. Neural networks automatically find nonlinear transformations of the raw data that highlight useful features that make the classification task easier. We describe a principled way to choose sensor locations and show that accurate classification of line outages can be achieved from a restricted set of measurements, even over a wide range of demand profiles.
△ Less
Submitted 27 March, 2018; v1 submitted 16 October, 2017;
originally announced October 2017.
-
Online Algorithms for Factorization-Based Structure from Motion
Authors:
Ryan Kennedy,
Laura Balzano,
Stephen J. Wright,
Camillo J. Taylor
Abstract:
We present a family of online algorithms for real-time factorization-based structure from motion, leveraging a relationship between incremental singular value decomposition and recently proposed methods for online matrix completion. Our methods are orders of magnitude faster than previous state of the art, can handle missing data and a variable number of feature points, and are robust to noise and…
▽ More
We present a family of online algorithms for real-time factorization-based structure from motion, leveraging a relationship between incremental singular value decomposition and recently proposed methods for online matrix completion. Our methods are orders of magnitude faster than previous state of the art, can handle missing data and a variable number of feature points, and are robust to noise and sparse outliers. We demonstrate our methods on both real and synthetic sequences and show that they perform well in both online and batch settings. We also provide an implementation which is able to produce 3D models in real time using a laptop with a webcam.
△ Less
Submitted 16 July, 2016; v1 submitted 26 September, 2013;
originally announced September 2013.
-
On GROUSE and Incremental SVD
Authors:
Laura Balzano,
Stephen J. Wright
Abstract:
GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an incremental algorithm for identifying a subspace of Rn from a sequence of vectors in this subspace, where only a subset of components of each vector is revealed at each iteration. Recent analysis has shown that GROUSE converges locally at an expected linear rate, under certain assumptions. GROUSE has a similar flavor to the incrementa…
▽ More
GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an incremental algorithm for identifying a subspace of Rn from a sequence of vectors in this subspace, where only a subset of components of each vector is revealed at each iteration. Recent analysis has shown that GROUSE converges locally at an expected linear rate, under certain assumptions. GROUSE has a similar flavor to the incremental singular value decomposition algorithm, which updates the SVD of a matrix following addition of a single column. In this paper, we modify the incremental SVD approach to handle missing data, and demonstrate that this modified approach is equivalent to GROUSE, for a certain choice of an algorithmic parameter.
△ Less
Submitted 20 July, 2013;
originally announced July 2013.
-
Robust Dequantized Compressive Sensing
Authors:
Ji Liu,
Stephen J. Wright
Abstract:
We consider the reconstruction problem in compressed sensing in which the observations are recorded in a finite number of bits. They may thus contain quantization errors (from being rounded to the nearest representable value) and saturation errors (from being outside the range of representable values). Our formulation has an objective of weighted $\ell_2$-$\ell_1$ type, along with constraints that…
▽ More
We consider the reconstruction problem in compressed sensing in which the observations are recorded in a finite number of bits. They may thus contain quantization errors (from being rounded to the nearest representable value) and saturation errors (from being outside the range of representable values). Our formulation has an objective of weighted $\ell_2$-$\ell_1$ type, along with constraints that account explicitly for quantization and saturation errors, and is solved with an augmented Lagrangian method. We prove a consistency result for the recovered solution, stronger than those that have appeared to date in the literature, showing in particular that asymptotic consistency can be obtained without oversampling. We present extensive computational comparisons with formulations proposed previously, and variants thereof.
△ Less
Submitted 10 October, 2013; v1 submitted 3 July, 2012;
originally announced July 2012.
-
Approximate Stochastic Subgradient Estimation Training for Support Vector Machines
Authors:
Sangkyun Lee,
Stephen J. Wright
Abstract:
Subgradient algorithms for training support vector machines have been quite successful for solving large-scale and online learning problems. However, they have been restricted to linear kernels and strongly convex formulations. This paper describes efficient subgradient approaches without such limitations. Our approaches make use of randomized low-dimensional approximations to nonlinear kernels, a…
▽ More
Subgradient algorithms for training support vector machines have been quite successful for solving large-scale and online learning problems. However, they have been restricted to linear kernels and strongly convex formulations. This paper describes efficient subgradient approaches without such limitations. Our approaches make use of randomized low-dimensional approximations to nonlinear kernels, and minimization of a reduced primal formulation using an algorithm based on robust stochastic approximation, which do not require strong convexity. Experiments illustrate that our approaches produce solutions of comparable prediction accuracy with the solutions acquired from existing SVM solvers, but often in much shorter time. We also suggest efficient prediction schemes that depend only on the dimension of kernel approximation, not on the number of support vectors.
△ Less
Submitted 3 November, 2011; v1 submitted 2 November, 2011;
originally announced November 2011.
-
HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
Authors:
Feng Niu,
Benjamin Recht,
Christopher Re,
Stephen J. Wright
Abstract:
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be impleme…
▽ More
Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.
△ Less
Submitted 11 November, 2011; v1 submitted 28 June, 2011;
originally announced June 2011.
-
Convex Approaches to Model Wavelet Sparsity Patterns
Authors:
Nikhil S Rao,
Robert D. Nowak,
Stephen J. Wright,
Nick G. Kingsbury
Abstract:
Statistical dependencies among wavelet coefficients are commonly represented by graphical models such as hidden Markov trees(HMTs). However, in linear inverse problems such as deconvolution, tomography, and compressed sensing, the presence of a sensing or observation matrix produces a linear mixing of the simple Markovian dependency structure. This leads to reconstruction problems that are non-con…
▽ More
Statistical dependencies among wavelet coefficients are commonly represented by graphical models such as hidden Markov trees(HMTs). However, in linear inverse problems such as deconvolution, tomography, and compressed sensing, the presence of a sensing or observation matrix produces a linear mixing of the simple Markovian dependency structure. This leads to reconstruction problems that are non-convex optimizations. Past work has dealt with this issue by resorting to greedy or suboptimal iterative reconstruction methods. In this paper, we propose new modeling approaches based on group-sparsity penalties that leads to convex optimizations that can be solved exactly and efficiently. We show that the methods we develop perform significantly better in deconvolution and compressed sensing applications, while being as computationally efficient as standard coefficient-wise approaches such as lasso.
△ Less
Submitted 22 April, 2011;
originally announced April 2011.