Search | arXiv e-print repository

Towards Weaker Variance Assumptions for Stochastic Optimization

Authors: Ahmet Alacaoglu, Yura Malitsky, Stephen J. Wright

Abstract: We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship t… ▽ More We revisit a classical assumption for analyzing stochastic gradient algorithms where the squared norm of the stochastic subgradient (or the variance for smooth problems) is allowed to grow as fast as the squared norm of the optimization variable. We contextualize this assumption in view of its inception in the 1960s, its seemingly independent appearance in the recent literature, its relationship to weakest-known variance assumptions for analyzing stochastic gradient algorithms, and its relevance in deterministic problems for non-Lipschitz nonsmooth convex optimization. We build on and extend a connection recently made between this assumption and the Halpern iteration. For convex nonsmooth, and potentially stochastic, optimization, we analyze horizon-free, anytime algorithms with last-iterate rates. For problems beyond simple constrained optimization, such as convex problems with functional constraints or regularized convex-concave min-max problems, we obtain rates for optimality measures that do not require boundedness of the feasible set. △ Less

Submitted 14 April, 2025; originally announced April 2025.

arXiv:2502.03701 [pdf, ps, other]

First-ish Order Methods: Hessian-aware Scalings of Gradient Descent

Authors: Oscar Smee, Fred Roosta, Stephen J. Wright

Abstract: Gradient descent is the primary workhorse for optimizing large-scale problems in machine learning. However, its performance is highly sensitive to the choice of the learning rate. A key limitation of gradient descent is its lack of natural scaling, which often necessitates expensive line searches or heuristic tuning to determine an appropriate step size. In this paper, we address this limitation b… ▽ More Gradient descent is the primary workhorse for optimizing large-scale problems in machine learning. However, its performance is highly sensitive to the choice of the learning rate. A key limitation of gradient descent is its lack of natural scaling, which often necessitates expensive line searches or heuristic tuning to determine an appropriate step size. In this paper, we address this limitation by incorporating Hessian information to scale the gradient direction. By accounting for the curvature of the function along the gradient, our adaptive, Hessian-aware scaling method ensures a local unit step size guarantee, even in nonconvex settings. Near a local minimum that satisfies the second-order sufficient conditions, our approach achieves linear convergence with a unit step size. We show that our method converges globally under a significantly weaker version of the standard Lipschitz gradient smoothness assumption. Even when Hessian information is inexact, the local unit step size guarantee and global convergence properties remain valid under mild conditions. Finally, we validate our theoretical results empirically on a range of convex and nonconvex machine learning tasks, showcasing the effectiveness of the approach. △ Less

Submitted 2 June, 2025; v1 submitted 5 February, 2025; originally announced February 2025.

MSC Class: 49

arXiv:2412.11003 [pdf, other]

Optimal Rates for Robust Stochastic Convex Optimization

Authors: Changyu Gao, Andrew Lowy, Xingyu Zhou, Stephen J. Wright

Abstract: Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the $ε$-contamination model, where an adversary can inspect and replace up to an $ε$-fraction of the samples, a fundamental open problem is determining the optimal rates for robust st… ▽ More Machine learning algorithms in high-dimensional settings are highly susceptible to the influence of even a small fraction of structured outliers, making robust optimization techniques essential. In particular, within the $ε$-contamination model, where an adversary can inspect and replace up to an $ε$-fraction of the samples, a fundamental open problem is determining the optimal rates for robust stochastic convex optimization (SCO) under such contamination. We develop novel algorithms that achieve minimax-optimal excess risk (up to logarithmic factors) under the $ε$-contamination model. Our approach improves over existing algorithms, which are not only suboptimal but also require stringent assumptions, including Lipschitz continuity and smoothness of individual sample functions. By contrast, our optimal algorithms do not require these stringent assumptions, assuming only population-level smoothness of the loss. Moreover, our algorithms can be adapted to handle the case in which the covariance parameter is unknown, and can be extended to nonsmooth population risks via convolutional smoothing. We complement our algorithmic developments with a tight information-theoretic lower bound for robust SCO. △ Less

Submitted 23 April, 2025; v1 submitted 14 December, 2024; originally announced December 2024.

Comments: The 6th annual Symposium on Foundations of Responsible Computing (FORC 2025)

arXiv:2407.09690 [pdf, other]

Private Heterogeneous Federated Learning Without a Trusted Server Revisited: Error-Optimal and Communication-Efficient Algorithms for Convex Losses

Authors: Changyu Gao, Andrew Lowy, Xingyu Zhou, Stephen J. Wright

Abstract: We revisit the problem of federated learning (FL) with private data from people who do not trust the server or other silos/clients. In this context, every silo (e.g. hospital) has data from several people (e.g. patients) and needs to protect the privacy of each person's data (e.g. health records), even if the server and/or other silos try to uncover this data. Inter-Silo Record-Level Differential… ▽ More We revisit the problem of federated learning (FL) with private data from people who do not trust the server or other silos/clients. In this context, every silo (e.g. hospital) has data from several people (e.g. patients) and needs to protect the privacy of each person's data (e.g. health records), even if the server and/or other silos try to uncover this data. Inter-Silo Record-Level Differential Privacy (ISRL-DP) prevents each silo's data from being leaked, by requiring that silo i's communications satisfy item-level differential privacy. Prior work arXiv:2106.09779 characterized the optimal excess risk bounds for ISRL-DP algorithms with homogeneous (i.i.d.) silo data and convex loss functions. However, two important questions were left open: (1) Can the same excess risk bounds be achieved with heterogeneous (non-i.i.d.) silo data? (2) Can the optimal risk bounds be achieved with fewer communication rounds? In this paper, we give positive answers to both questions. We provide novel ISRL-DP FL algorithms that achieve the optimal excess risk bounds in the presence of heterogeneous silo data. Moreover, our algorithms are more communication-efficient than the prior state-of-the-art. For smooth loss functions, our algorithm achieves the optimal excess risk bound and has communication complexity that matches the non-private lower bound. Additionally, our algorithms are more computationally efficient than the previous state-of-the-art. △ Less

Submitted 6 September, 2024; v1 submitted 12 July, 2024; originally announced July 2024.

Comments: The 41st International Conference on Machine Learning (ICML 2024)

arXiv:2403.10547 [pdf, ps, other]

Robust Second-Order Nonconvex Optimization and Its Application to Low Rank Matrix Sensing

Authors: Shuyao Li, Yu Cheng, Ilias Diakonikolas, Jelena Diakonikolas, Rong Ge, Stephen J. Wright

Abstract: Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong c… ▽ More Finding an approximate second-order stationary point (SOSP) is a well-studied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with \emph{dimension-independent} accuracy guarantees, using $\widetilde{O}({D^2}/ε)$ samples where $D$ is the ambient dimension and $ε$ is the fraction of corrupted datapoints. As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms. △ Less

Submitted 11 March, 2024; originally announced March 2024.

arXiv:2402.11173 [pdf, other]

How to Make the Gradients Small Privately: Improved Rates for Differentially Private Non-Convex Optimization

Authors: Andrew Lowy, Jonathan Ullman, Stephen J. Wright

Abstract: We provide a simple and flexible framework for designing differentially private algorithms to find approximate stationary points of non-convex loss functions. Our framework is based on using a private approximate risk minimizer to "warm start" another private algorithm for finding stationary points. We use this framework to obtain improved, and sometimes optimal, rates for several classes of non-c… ▽ More We provide a simple and flexible framework for designing differentially private algorithms to find approximate stationary points of non-convex loss functions. Our framework is based on using a private approximate risk minimizer to "warm start" another private algorithm for finding stationary points. We use this framework to obtain improved, and sometimes optimal, rates for several classes of non-convex loss functions. First, we obtain improved rates for finding stationary points of smooth non-convex empirical loss functions. Second, we specialize to quasar-convex functions, which generalize star-convex functions and arise in learning dynamical systems and training some neural nets. We achieve the optimal rate for this class. Third, we give an optimal algorithm for finding stationary points of functions satisfying the Kurdyka-Lojasiewicz (KL) condition. For example, over-parameterized neural networks often satisfy this condition. Fourth, we provide new state-of-the-art rates for stationary points of non-convex population loss functions. Fifth, we obtain improved rates for non-convex generalized linear models. A modification of our algorithm achieves nearly the same rates for second-order stationary points of functions with Lipschitz Hessian, improving over the previous state-of-the-art for each of the above problems. △ Less

Submitted 19 August, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

Comments: ICML 2024

arXiv:2402.05071 [pdf, other]

Revisiting Inexact Fixed-Point Iterations for Min-Max Problems: Stochasticity and Structured Nonconvexity

Authors: Ahmet Alacaoglu, Donghwan Kim, Stephen J. Wright

Abstract: We focus on constrained, $L$-smooth, potentially stochastic and nonconvex-nonconcave min-max problems either satisfying $ρ$-cohypomonotonicity or admitting a solution to the $ρ$-weakly Minty Variational Inequality (MVI), where larger values of the parameter $ρ>0$ correspond to a greater degree of nonconvexity. These problem classes include examples in two player reinforcement learning, interaction… ▽ More We focus on constrained, $L$-smooth, potentially stochastic and nonconvex-nonconcave min-max problems either satisfying $ρ$-cohypomonotonicity or admitting a solution to the $ρ$-weakly Minty Variational Inequality (MVI), where larger values of the parameter $ρ>0$ correspond to a greater degree of nonconvexity. These problem classes include examples in two player reinforcement learning, interaction dominant min-max problems, and certain synthetic test problems on which classical min-max algorithms fail. It has been conjectured that first-order methods can tolerate a value of $ρ$ no larger than $\frac{1}{L}$, but existing results in the literature have stagnated at the tighter requirement $ρ< \frac{1}{2L}$. With a simple argument, we obtain optimal or best-known complexity guarantees with cohypomonotonicity or weak MVI conditions for $ρ< \frac{1}{L}$. First main insight for the improvements in the convergence analyses is to harness the recently proposed $\textit{conic nonexpansiveness}$ property of operators. Second, we provide a refined analysis for inexact Halpern iteration that relaxes the required inexactness level to improve some state-of-the-art complexity results even for constrained stochastic convex-concave min-max problems. Third, we analyze a stochastic inexact Krasnosel'skiĭ-Mann iteration with a multilevel Monte Carlo estimator when the assumptions only hold with respect to a solution. △ Less

Submitted 12 August, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

Journal ref: Proceedings of the International Conference on Machine Learning (ICML) 2024

arXiv:2311.00678 [pdf, other]

Complexity of Single Loop Algorithms for Nonlinear Programming with Stochastic Objective and Constraints

Authors: Ahmet Alacaoglu, Stephen J. Wright

Abstract: We analyze the complexity of single-loop quadratic penalty and augmented Lagrangian algorithms for solving nonconvex optimization problems with functional equality constraints. We consider three cases, in all of which the objective is stochastic and smooth, that is, an expectation over an unknown distribution that is accessed by sampling. The nature of the equality constraints differs among the th… ▽ More We analyze the complexity of single-loop quadratic penalty and augmented Lagrangian algorithms for solving nonconvex optimization problems with functional equality constraints. We consider three cases, in all of which the objective is stochastic and smooth, that is, an expectation over an unknown distribution that is accessed by sampling. The nature of the equality constraints differs among the three cases: deterministic and linear in the first case, deterministic, smooth and nonlinear in the second case, and stochastic, smooth and nonlinear in the third case. Variance reduction techniques are used to improve the complexity. To find a point that satisfies $\varepsilon$-approximate first-order conditions, we require $\widetilde{O}(\varepsilon^{-3})$ complexity in the first case, $\widetilde{O}(\varepsilon^{-4})$ in the second case, and $\widetilde{O}(\varepsilon^{-5})$ in the third case. For the first and third cases, they are the first algorithms of "single loop" type (that also use $O(1)$ samples at each iteration) that still achieve the best-known complexity guarantees. △ Less

Submitted 1 November, 2023; originally announced November 2023.

arXiv:2310.18841 [pdf, ps, other]

A randomized algorithm for nonconvex minimization with inexact evaluations and complexity guarantees

Authors: Shuyao Li, Stephen J. Wright

Abstract: We consider minimization of a smooth nonconvex function with inexact oracle access to gradient and Hessian (without assuming access to the function value) to achieve approximate second-order optimality. A novel feature of our method is that if an approximate direction of negative curvature is chosen as the step, we choose its sense to be positive or negative with equal probability. We allow gradie… ▽ More We consider minimization of a smooth nonconvex function with inexact oracle access to gradient and Hessian (without assuming access to the function value) to achieve approximate second-order optimality. A novel feature of our method is that if an approximate direction of negative curvature is chosen as the step, we choose its sense to be positive or negative with equal probability. We allow gradients to be inexact in a relative sense and relax the coupling between inexactness thresholds for the first- and second-order optimality conditions. Our convergence analysis includes both an expectation bound based on martingale analysis and a high-probability bound based on concentration inequalities. We apply our algorithm to empirical risk minimization problems and obtain improved gradient sample complexity over existing works. △ Less

Submitted 26 March, 2024; v1 submitted 28 October, 2023; originally announced October 2023.

arXiv:2310.04006 [pdf, other]

Accelerating optimization over the space of probability measures

Authors: Shi Chen, Qin Li, Oliver Tse, Stephen J. Wright

Abstract: The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this contex… ▽ More The acceleration of gradient-based optimization methods is a subject of significant practical and theoretical importance, particularly within machine learning applications. While much attention has been directed towards optimizing within Euclidean space, the need to optimize over spaces of probability measures in machine learning motivates exploration of accelerated gradient methods in this context too. To this end, we introduce a Hamiltonian-flow approach analogous to momentum-based approaches in Euclidean space. We demonstrate that, in the continuous-time setting, algorithms based on this approach can achieve convergence rates of arbitrarily high order. We complement our findings with numerical examples. △ Less

Submitted 10 November, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

arXiv:2306.02192 [pdf, other]

Correcting auto-differentiation in neural-ODE training

Authors: Yewei Xu, Shi Chen, Qin Li, Stephen J. Wright

Abstract: Does the use of auto-differentiation yield reasonable updates to deep neural networks that represent neural ODEs? Through mathematical analysis and numerical evidence, we find that when the neural network employs high-order forms to approximate the underlying ODE flows (such as the Linear Multistep Method (LMM)), brute-force computation using auto-differentiation often produces non-converging arti… ▽ More Does the use of auto-differentiation yield reasonable updates to deep neural networks that represent neural ODEs? Through mathematical analysis and numerical evidence, we find that when the neural network employs high-order forms to approximate the underlying ODE flows (such as the Linear Multistep Method (LMM)), brute-force computation using auto-differentiation often produces non-converging artificial oscillations. In the case of Leapfrog, we propose a straightforward post-processing technique that effectively eliminates these oscillations, rectifies the gradient computation and thus respects the updates of the underlying flow. △ Less

Submitted 3 June, 2023; originally announced June 2023.

arXiv:2302.04972 [pdf, ps, other]

Differentially Private Optimization for Smooth Nonconvex ERM

Authors: Changyu Gao, Stephen J. Wright

Abstract: We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches. We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches. △ Less

Submitted 9 June, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

arXiv:2301.07831 [pdf, other]

doi 10.1016/j.cma.2023.116130

Multi-output multilevel best linear unbiased estimators via semidefinite programming

Authors: M. Croci, K. E. Willcox, S. J. Wright

Abstract: Multifidelity forward uncertainty quantification (UQ) problems often involve multiple quantities of interest and heterogeneous models (e.g., different grids, equations, dimensions, physics, surrogate and reduced-order models). While computational efficiency is key in this context, multi-output strategies in multilevel/multifidelity methods are either sub-optimal or non-existent. In this paper we e… ▽ More Multifidelity forward uncertainty quantification (UQ) problems often involve multiple quantities of interest and heterogeneous models (e.g., different grids, equations, dimensions, physics, surrogate and reduced-order models). While computational efficiency is key in this context, multi-output strategies in multilevel/multifidelity methods are either sub-optimal or non-existent. In this paper we extend multilevel best linear unbiased estimators (MLBLUE) to multi-output forward UQ problems and we present new semidefinite programming formulations for their optimal setup. Not only do these formulations yield the optimal number of samples required, but also the optimal selection of low-fidelity models to use. While existing MLBLUE approaches are single-output only and require a non-trivial nonlinear optimization procedure, the new multi-output formulations can be solved reliably and efficiently. We demonstrate the efficacy of the new methods and formulations in practical UQ problems with model heterogeneity. △ Less

Submitted 15 May, 2023; v1 submitted 18 January, 2023; originally announced January 2023.

Comments: 22 pages, 5 figures, 3 tables

arXiv:2212.05088 [pdf, other]

Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization

Authors: Xufeng Cai, Chaobing Song, Stephen J. Wright, Jelena Diakonikolas

Abstract: Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent prog… ▽ More Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result when a Polyak-Łojasiewicz (PŁ) condition holds. To our knowledge, this work is the first to provide non-asymptotic convergence guarantees -- variance-reduced or not -- for a cyclic block coordinate method in general composite (smooth + nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in training deep neural nets. △ Less

Submitted 27 January, 2023; v1 submitted 9 December, 2022; originally announced December 2022.

arXiv:2201.07684 [pdf, other]

On the Complexity of a Practical Primal-Dual Coordinate Method

Authors: Ahmet Alacaoglu, Volkan Cevher, Stephen J. Wright

Abstract: We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problem… ▽ More We prove complexity bounds for the primal-dual algorithm with random extrapolation and coordinate descent (PURE-CD), which has been shown to obtain good practical performance for solving convex-concave min-max problems with bilinear coupling. Our complexity bounds either match or improve the best-known results in the literature for both dense and sparse (strongly)-convex-(strongly)-concave problems. △ Less

Submitted 19 January, 2022; originally announced January 2022.

arXiv:2111.01842 [pdf, other]

Coordinate Linear Variance Reduction for Generalized Linear Programming

Authors: Chaobing Song, Cheuk Yin Lin, Stephen J. Wright, Jelena Diakonikolas

Abstract: We study a class of generalized linear programs (GLP) in a large-scale setting, which includes simple, possibly nonsmooth convex regularizer and simple convex set constraints. By reformulating (GLP) as an equivalent convex-concave min-max problem, we show that the linear structure in the problem can be used to design an efficient, scalable first-order algorithm, to which we give the name \emph{Coo… ▽ More We study a class of generalized linear programs (GLP) in a large-scale setting, which includes simple, possibly nonsmooth convex regularizer and simple convex set constraints. By reformulating (GLP) as an equivalent convex-concave min-max problem, we show that the linear structure in the problem can be used to design an efficient, scalable first-order algorithm, to which we give the name \emph{Coordinate Linear Variance Reduction} (\textsc{clvr}; pronounced "clever"). \textsc{clvr} yields improved complexity results for (GLP) that depend on the max row norm of the linear constraint matrix in (GLP) rather than the spectral norm. When the regularization terms and constraints are separable, \textsc{clvr} admits an efficient lazy update strategy that makes its complexity bounds scale with the number of nonzero elements of the linear constraint matrix in (GLP) rather than the matrix dimensions. On the other hand, for the special case of linear programs, by exploiting sharpness, we propose a restart scheme for \textsc{clvr} to obtain empirical linear convergence. Then we show that Distributionally Robust Optimization (DRO) problems with ambiguity sets based on both $f$-divergence and Wasserstein metrics can be reformulated as (GLPs) by introducing sparsely connected auxiliary variables. We complement our theoretical guarantees with numerical experiments that verify our algorithm's practical effectiveness, in terms of wall-clock time and number of data passes. △ Less

Submitted 6 April, 2023; v1 submitted 2 November, 2021; originally announced November 2021.

Comments: 39 pages, NeurIPS 2022

arXiv:2104.11079 [pdf, other]

doi 10.2172/1807223

Randomized Algorithms for Scientific Computing (RASC)

Authors: Aydin Buluc, Tamara G. Kolda, Stefan M. Wild, Mihai Anitescu, Anthony DeGennaro, John Jakeman, Chandrika Kamath, Ramakrishnan Kannan, Miles E. Lopes, Per-Gunnar Martinsson, Kary Myers, Jelani Nelson, Juan M. Restrepo, C. Seshadhri, Draguna Vrabie, Brendt Wohlberg, Stephen J. Wright, Chao Yang, Peter Zwart

Abstract: Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and sc… ▽ More Randomized algorithms have propelled advances in artificial intelligence and represent a foundational research area in advancing AI for Science. Future advancements in DOE Office of Science priority areas such as climate science, astrophysics, fusion, advanced materials, combustion, and quantum computing all require randomized algorithms for surmounting challenges of complexity, robustness, and scalability. This report summarizes the outcomes of that workshop, "Randomized Algorithms for Scientific Computing (RASC)," held virtually across four days in December 2020 and January 2021. △ Less

Submitted 21 March, 2022; v1 submitted 19 April, 2021; originally announced April 2021.

arXiv:2102.13643 [pdf, other]

Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums

Authors: Chaobing Song, Stephen J. Wright, Jelena Diakonikolas

Abstract: We study structured nonsmooth convex finite-sum optimization that appears widely in machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called \emph{Variance Reduction via Primal-Dual Accelerated Dual Averaging (\vrpda)}. In the nonsmooth and general convex setting, \vrpda~has t… ▽ More We study structured nonsmooth convex finite-sum optimization that appears widely in machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called \emph{Variance Reduction via Primal-Dual Accelerated Dual Averaging (\vrpda)}. In the nonsmooth and general convex setting, \vrpda~has the overall complexity $O(nd\log\min \{1/ε, n\} + d/ε)$ in terms of the primal-dual gap, where $n$ denotes the number of samples, $d$ the dimension of the primal variables, and $ε$ the desired accuracy. In the nonsmooth and strongly convex setting, the overall complexity of \vrpda~becomes $O(nd\log\min\{1/ε, n\} + d/\sqrtε)$ in terms of both the primal-dual gap and the distance between iterate and optimal solution. Both these results for \vrpda~improve significantly on state-of-the-art complexity estimates, which are $O(nd\log \min\{1/ε, n\} + \sqrt{n}d/ε)$ for the nonsmooth and general convex setting and $O(nd\log \min\{1/ε, n\} + \sqrt{n}d/\sqrtε)$ for the nonsmooth and strongly convex setting, in a much more simple and straightforward way. Moreover, both complexities are better than \emph{lower} bounds for general convex finite sums that lack the particular (common) structure that we consider. Our theoretical results are supported by numerical experiments, which confirm the competitive performance of \vrpda~compared to state-of-the-art. △ Less

Submitted 7 April, 2021; v1 submitted 26 February, 2021; originally announced February 2021.

Comments: 33 pages, 18 figures

arXiv:2010.11366 [pdf, ps, other]

Random Coordinate Underdamped Langevin Monte Carlo

Authors: Zhiyan Ding, Qin Li, Jianfeng Lu, Stephen J. Wright

Abstract: The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the… ▽ More The Underdamped Langevin Monte Carlo (ULMC) is a popular Markov chain Monte Carlo sampling method. It requires the computation of the full gradient of the log-density at each iteration, an expensive operation if the dimension of the problem is high. We propose a sampling method called Random Coordinate ULMC (RC-ULMC), which selects a single coordinate at each iteration to be updated and leaves the other coordinates untouched. We investigate the computational complexity of RC-ULMC and compare it with the classical ULMC for strongly log-concave probability distributions. We show that RC-ULMC is always cheaper than the classical ULMC, with a significant cost reduction when the problem is highly skewed and high dimensional. Our complexity bound for RC-ULMC is also tight in terms of dimension dependence. △ Less

Submitted 21 October, 2020; originally announced October 2020.

arXiv:2010.01405 [pdf, ps, other]

Random Coordinate Langevin Monte Carlo

Authors: Zhiyan Ding, Qin Li, Jianfeng Lu, Stephen J. Wright

Abstract: Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the part… ▽ More Langevin Monte Carlo (LMC) is a popular Markov chain Monte Carlo sampling method. One drawback is that it requires the computation of the full gradient at each iteration, an expensive operation if the dimension of the problem is high. We propose a new sampling method: Random Coordinate LMC (RC-LMC). At each iteration, a single coordinate is randomly selected to be updated by a multiple of the partial derivative along this direction plus noise, and all other coordinates remain untouched. We investigate the total complexity of RC-LMC and compare it with the classical LMC for log-concave probability distributions. When the gradient of the log-density is Lipschitz, RC-LMC is less expensive than the classical LMC if the log-density is highly skewed for high dimensional problems, and when both the gradient and the Hessian of the log-density are Lipschitz, RC-LMC is always cheaper than the classical LMC, by a factor proportional to the square root of the problem dimension. In the latter case, our estimate of complexity is sharp with respect to the dimension. △ Less

Submitted 3 October, 2020; originally announced October 2020.

arXiv:2005.13815 [pdf, ps, other]

Adversarial Classification via Distributional Robustness with Wasserstein Ambiguity

Authors: Nam Ho-Nguyen, Stephen J. Wright

Abstract: We study a model for adversarial classification based on distributionally robust chance constraints. We show that under Wasserstein ambiguity, the model aims to minimize the conditional value-at-risk of the distance to misclassification, and we explore links to adversarial classification models proposed earlier and to maximum-margin classifiers. We also provide a reformulation of the distributiona… ▽ More We study a model for adversarial classification based on distributionally robust chance constraints. We show that under Wasserstein ambiguity, the model aims to minimize the conditional value-at-risk of the distance to misclassification, and we explore links to adversarial classification models proposed earlier and to maximum-margin classifiers. We also provide a reformulation of the distributionally robust model for linear classification, and show it is equivalent to minimizing a regularized ramp loss objective. Numerical experiments show that, despite the nonconvexity of this formulation, standard descent methods appear to converge to the global minimizer for this problem. Inspired by this observation, we show that, for a certain class of distributions, the only stationary point of the regularized ramp loss minimization problem is the global minimizer. △ Less

Submitted 3 November, 2021; v1 submitted 28 May, 2020; originally announced May 2020.

Comments: 32 pages

arXiv:1912.08756 [pdf, other]

Interleaved Composite Quantization for High-Dimensional Similarity Search

Authors: Soroosh Khoram, Stephen J Wright, Jing Li

Abstract: Similarity search retrieves the nearest neighbors of a query vector from a dataset of high-dimensional vectors. As the size of the dataset grows, the cost of performing the distance computations needed to implement a query can become prohibitive. A method often used to reduce this computational cost is quantization of the vector space and location-based encoding of the dataset vectors. These encod… ▽ More Similarity search retrieves the nearest neighbors of a query vector from a dataset of high-dimensional vectors. As the size of the dataset grows, the cost of performing the distance computations needed to implement a query can become prohibitive. A method often used to reduce this computational cost is quantization of the vector space and location-based encoding of the dataset vectors. These encodings can be used during query processing to find approximate nearest neighbors of the query point quickly. Search speed can be improved by using shorter codes, but shorter codes have higher quantization error, leading to degraded precision. In this work, we propose the Interleaved Composite Quantization (ICQ) which achieves fast similarity search without using shorter codes. In ICQ, a small subset of the code is used to approximate the distances, with complete codes being used only when necessary. Our method effectively reduces both code length and quantization error. Furthermore, ICQ is compatible with several recently proposed techniques for reducing quantization error and can be used in conjunction with these other techniques to improve results. We confirm these claims and show strong empirical performance of ICQ using several synthetic and real-word datasets. △ Less

Submitted 18 December, 2019; v1 submitted 18 December, 2019; originally announced December 2019.

arXiv:1912.06508 [pdf, other]

A Distributed Quasi-Newton Algorithm for Primal and Dual Regularized Empirical Risk Minimization

Authors: Ching-pei Lee, Cong Han Lim, Stephen J. Wright

Abstract: We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving empirical risk minimization (ERM) problems with a nonsmooth regularization term. Our algorithm is applicable to both the primal and the dual ERM problem. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting… ▽ More We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving empirical risk minimization (ERM) problems with a nonsmooth regularization term. Our algorithm is applicable to both the primal and the dual ERM problem. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations of the smooth part, and we describe how to maintain an approximation of the (generalized) Hessian and solve subproblems efficiently in a distributed manner. When applied to the distributed dual ERM problem, unlike state of the art that takes only the block-diagonal part of the Hessian, our approach is able to utilize global curvature information and is thus magnitudes faster. The proposed method enjoys global linear convergence for a broad range of non-strongly convex problems that includes the most commonly used ERMs, thus requiring lower communication complexity. It also converges on non-convex problems, so has the potential to be used on applications such as deep learning. Computational results demonstrate that our method significantly improves on communication cost and running time over the current state-of-the-art methods. △ Less

Submitted 12 December, 2019; originally announced December 2019.

Comments: arXiv admin note: text overlap with arXiv:1803.01370

arXiv:1803.01370 [pdf, other]

doi 10.1145/3219819.3220075

A Distributed Quasi-Newton Algorithm for Empirical Risk Minimization with Nonsmooth Regularization

Authors: Ching-pei Lee, Cong Han Lim, Stephen J. Wright

Abstract: We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving ERM problems with a nonsmooth regularization term. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations, and we… ▽ More We propose a communication- and computation-efficient distributed optimization algorithm using second-order information for solving ERM problems with a nonsmooth regularization term. Current second-order and quasi-Newton methods for this problem either do not work well in the distributed setting or work only for specific regularizers. Our algorithm uses successive quadratic approximations, and we describe how to maintain an approximation of the Hessian and solve subproblems efficiently in a distributed manner. The proposed method enjoys global linear convergence for a broad range of non-strongly convex problems that includes the most commonly used ERMs, thus requiring lower communication complexity. It also converges on non-convex problems, so has the potential to be used on applications such as deep learning. Initial computational results on convex problems demonstrate that our method significantly improves on communication cost and running time over the current state-of-the-art methods. △ Less

Submitted 26 May, 2018; v1 submitted 4 March, 2018; originally announced March 2018.

Comments: In the proceedings of The 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018

arXiv:1801.08019 [pdf, other]

Training Set Debugging Using Trusted Items

Authors: Xuezhou Zhang, Xiaojin Zhu, Stephen J. Wright

Abstract: Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for man- ual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by itself be adequate for learning, so we propose an algorithm that uses these items to identify bugs in the training set and thus im- proves learning. Specifical… ▽ More Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for man- ual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by itself be adequate for learning, so we propose an algorithm that uses these items to identify bugs in the training set and thus im- proves learning. Specifically, our approach seeks the smallest set of changes to the training set labels such that the model learned from this corrected training set predicts labels of the trusted items correctly. We flag the items whose labels are changed as potential bugs, whose labels can be checked for veracity by human experts. To find the bugs in this way is a challenging combinatorial bilevel optimization problem, but it can be relaxed into a continuous optimization problem. Ex- periments on toy and real data demonstrate that our approach can identify training set bugs effectively and suggest appro- priate changes to the labels. Our algorithm is a step toward trustworthy machine learning. △ Less

Submitted 24 January, 2018; originally announced January 2018.

Comments: AAAI 2018

arXiv:1710.05916 [pdf, other]

Using Neural Networks to Detect Line Outages from PMU Data

Authors: Ching-pei Lee, Stephen J. Wright

Abstract: We propose an approach based on neural networks and the AC power flow equations to identify single- and double-line outages in a power grid using the information from phasor measurement unit sensors (PMUs) placed on only a subset of the buses. Rather than inferring the outage from the sensor data by inverting the physical model, our approach uses the AC model to simulate sensor responses to all ou… ▽ More We propose an approach based on neural networks and the AC power flow equations to identify single- and double-line outages in a power grid using the information from phasor measurement unit sensors (PMUs) placed on only a subset of the buses. Rather than inferring the outage from the sensor data by inverting the physical model, our approach uses the AC model to simulate sensor responses to all outages of interest under multiple demand and seasonal conditions, and uses the resulting data to train a neural network classifier to recognize and discriminate between different outage events directly from sensor data. After training, real-time deployment of the classifier requires just a few matrix-vector products and simple vector operations. These operations can be executed much more rapidly than inversion of a model based on AC power flow, which consists of nonlinear equations and possibly integer / binary variables representing line outages, as well as the variables representing voltages and power flows. We are motivated to use neural network by its successful application to such areas as computer vision and natural language processing. Neural networks automatically find nonlinear transformations of the raw data that highlight useful features that make the classification task easier. We describe a principled way to choose sensor locations and show that accurate classification of line outages can be achieved from a restricted set of measurements, even over a wide range of demand profiles. △ Less

Submitted 27 March, 2018; v1 submitted 16 October, 2017; originally announced October 2017.

arXiv:1309.6964 [pdf, other]

Online Algorithms for Factorization-Based Structure from Motion

Authors: Ryan Kennedy, Laura Balzano, Stephen J. Wright, Camillo J. Taylor

Abstract: We present a family of online algorithms for real-time factorization-based structure from motion, leveraging a relationship between incremental singular value decomposition and recently proposed methods for online matrix completion. Our methods are orders of magnitude faster than previous state of the art, can handle missing data and a variable number of feature points, and are robust to noise and… ▽ More We present a family of online algorithms for real-time factorization-based structure from motion, leveraging a relationship between incremental singular value decomposition and recently proposed methods for online matrix completion. Our methods are orders of magnitude faster than previous state of the art, can handle missing data and a variable number of feature points, and are robust to noise and sparse outliers. We demonstrate our methods on both real and synthetic sequences and show that they perform well in both online and batch settings. We also provide an implementation which is able to produce 3D models in real time using a laptop with a webcam. △ Less

Submitted 16 July, 2016; v1 submitted 26 September, 2013; originally announced September 2013.

arXiv:1307.5494 [pdf, other]

On GROUSE and Incremental SVD

Authors: Laura Balzano, Stephen J. Wright

Abstract: GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an incremental algorithm for identifying a subspace of Rn from a sequence of vectors in this subspace, where only a subset of components of each vector is revealed at each iteration. Recent analysis has shown that GROUSE converges locally at an expected linear rate, under certain assumptions. GROUSE has a similar flavor to the incrementa… ▽ More GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an incremental algorithm for identifying a subspace of Rn from a sequence of vectors in this subspace, where only a subset of components of each vector is revealed at each iteration. Recent analysis has shown that GROUSE converges locally at an expected linear rate, under certain assumptions. GROUSE has a similar flavor to the incremental singular value decomposition algorithm, which updates the SVD of a matrix following addition of a single column. In this paper, we modify the incremental SVD approach to handle missing data, and demonstrate that this modified approach is equivalent to GROUSE, for a certain choice of an algorithmic parameter. △ Less

Submitted 20 July, 2013; originally announced July 2013.

arXiv:1207.0577 [pdf, ps, other]

Robust Dequantized Compressive Sensing

Authors: Ji Liu, Stephen J. Wright

Abstract: We consider the reconstruction problem in compressed sensing in which the observations are recorded in a finite number of bits. They may thus contain quantization errors (from being rounded to the nearest representable value) and saturation errors (from being outside the range of representable values). Our formulation has an objective of weighted $\ell_2$-$\ell_1$ type, along with constraints that… ▽ More We consider the reconstruction problem in compressed sensing in which the observations are recorded in a finite number of bits. They may thus contain quantization errors (from being rounded to the nearest representable value) and saturation errors (from being outside the range of representable values). Our formulation has an objective of weighted $\ell_2$-$\ell_1$ type, along with constraints that account explicitly for quantization and saturation errors, and is solved with an augmented Lagrangian method. We prove a consistency result for the recovered solution, stronger than those that have appeared to date in the literature, showing in particular that asymptotic consistency can be obtained without oversampling. We present extensive computational comparisons with formulations proposed previously, and variants thereof. △ Less

Submitted 10 October, 2013; v1 submitted 3 July, 2012; originally announced July 2012.

arXiv:1111.0432 [pdf, ps, other]

Approximate Stochastic Subgradient Estimation Training for Support Vector Machines

Authors: Sangkyun Lee, Stephen J. Wright

Abstract: Subgradient algorithms for training support vector machines have been quite successful for solving large-scale and online learning problems. However, they have been restricted to linear kernels and strongly convex formulations. This paper describes efficient subgradient approaches without such limitations. Our approaches make use of randomized low-dimensional approximations to nonlinear kernels, a… ▽ More Subgradient algorithms for training support vector machines have been quite successful for solving large-scale and online learning problems. However, they have been restricted to linear kernels and strongly convex formulations. This paper describes efficient subgradient approaches without such limitations. Our approaches make use of randomized low-dimensional approximations to nonlinear kernels, and minimization of a reduced primal formulation using an algorithm based on robust stochastic approximation, which do not require strong convexity. Experiments illustrate that our approaches produce solutions of comparable prediction accuracy with the solutions acquired from existing SVM solvers, but often in much shorter time. We also suggest efficient prediction schemes that depend only on the dimension of kernel approximation, not on the number of support vectors. △ Less

Submitted 3 November, 2011; v1 submitted 2 November, 2011; originally announced November 2011.

Comments: An extended version of the ICPRAM 2012 paper

arXiv:1106.5730 [pdf, other]

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Authors: Feng Niu, Benjamin Recht, Christopher Re, Stephen J. Wright

Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be impleme… ▽ More Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude. △ Less

Submitted 11 November, 2011; v1 submitted 28 June, 2011; originally announced June 2011.

Comments: 22 pages, 10 figures

arXiv:1104.4385 [pdf, other]

Convex Approaches to Model Wavelet Sparsity Patterns

Authors: Nikhil S Rao, Robert D. Nowak, Stephen J. Wright, Nick G. Kingsbury

Abstract: Statistical dependencies among wavelet coefficients are commonly represented by graphical models such as hidden Markov trees(HMTs). However, in linear inverse problems such as deconvolution, tomography, and compressed sensing, the presence of a sensing or observation matrix produces a linear mixing of the simple Markovian dependency structure. This leads to reconstruction problems that are non-con… ▽ More Statistical dependencies among wavelet coefficients are commonly represented by graphical models such as hidden Markov trees(HMTs). However, in linear inverse problems such as deconvolution, tomography, and compressed sensing, the presence of a sensing or observation matrix produces a linear mixing of the simple Markovian dependency structure. This leads to reconstruction problems that are non-convex optimizations. Past work has dealt with this issue by resorting to greedy or suboptimal iterative reconstruction methods. In this paper, we propose new modeling approaches based on group-sparsity penalties that leads to convex optimizations that can be solved exactly and efficiently. We show that the methods we develop perform significantly better in deconvolution and compressed sensing applications, while being as computationally efficient as standard coefficient-wise approaches such as lasso. △ Less

Submitted 22 April, 2011; originally announced April 2011.

Showing 1–32 of 32 results for author: Wright, S J