Search | arXiv e-print repository

Stochastic Nested Compositional Bi-level Optimization for Robust Feature Learning

Authors: Xuxing Chen, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We develop and analyze stochastic approximation algorithms for solving nested compositional bi-level optimization problems. These problems involve a nested composition of $T$ potentially non-convex smooth functions in the upper-level, and a smooth and strongly convex function in the lower-level. Our proposed algorithm does not rely on matrix inversions or mini-batches and can achieve an $ε$-statio… ▽ More We develop and analyze stochastic approximation algorithms for solving nested compositional bi-level optimization problems. These problems involve a nested composition of $T$ potentially non-convex smooth functions in the upper-level, and a smooth and strongly convex function in the lower-level. Our proposed algorithm does not rely on matrix inversions or mini-batches and can achieve an $ε$-stationary solution with an oracle complexity of approximately $\tilde{O}_T(1/ε^{2})$, assuming the availability of stochastic first-order oracles for the individual functions in the composition and the lower-level, which are unbiased and have bounded moments. Here, $\tilde{O}_T$ hides polylog factors and constants that depend on $T$. The key challenge we address in establishing this result relates to handling three distinct sources of bias in the stochastic gradients. The first source arises from the compositional nature of the upper-level, the second stems from the bi-level structure, and the third emerges due to the utilization of Neumann series approximations to avoid matrix inversion. To demonstrate the effectiveness of our approach, we apply it to the problem of robust feature learning for deep neural networks under covariate shift, showcasing the benefits and advantages of our methodology in that context. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2302.09766 [pdf, other]

A One-Sample Decentralized Proximal Algorithm for Non-Convex Stochastic Composite Optimization

Authors: Tesi Xiao, Xuxing Chen, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find $ε$-stationary points in $\mathcal{O}(n^{-1}ε^{-2})$ iterations using constant b… ▽ More We focus on decentralized stochastic non-convex optimization, where $n$ agents work together to optimize a composite objective function which is a sum of a smooth term and a non-smooth convex term. To solve this problem, we propose two single-time scale algorithms: Prox-DASA and Prox-DASA-GT. These algorithms can find $ε$-stationary points in $\mathcal{O}(n^{-1}ε^{-2})$ iterations using constant batch sizes (i.e., $\mathcal{O}(1)$). Unlike prior work, our algorithms achieve comparable complexity without requiring large batch sizes, more complex per-iteration operations (such as double loops), or stronger assumptions. Our theoretical findings are supported by extensive numerical experiments, which demonstrate the superiority of our algorithms over previous approaches. Our code is available at https://github.com/xuxingc/ProxDASA. △ Less

Submitted 22 June, 2023; v1 submitted 20 February, 2023; originally announced February 2023.

Comments: UAI 2023

arXiv:2206.11346 [pdf, other]

Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data

Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study bo… ▽ More We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study both projection-based and projection-free algorithms. In both cases, we establish that the number of calls to the stochastic first-order oracle to obtain an appropriately defined $ε$-stationary point is of the order $\mathcal{O}(1/ε^{2.5})$. In the projection-free setting we additionally establish that the number of calls to the linear minimization oracle is of order $\mathcal{O}(1/ε^{5.5})$. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks. △ Less

Submitted 8 November, 2022; v1 submitted 22 June, 2022; originally announced June 2022.

Comments: 2 figures

arXiv:2202.04296 [pdf, ps, other]

A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization

Authors: Tesi Xiao, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of $T$ functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard un… ▽ More We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of $T$ functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second moment assumptions. We show that the number of calls to the stochastic first-order oracle and the linear-minimization oracle required by the proposed algorithm, to obtain an $ε$-stationary solution, are of order $\mathcal{O}_T(ε^{-2})$ and $\mathcal{O}_T(ε^{-3})$ respectively, where $\mathcal{O}_T$ hides constants in $T$. Notably, the dependence of these complexity bounds on $ε$ and $T$ are separate in the sense that changing one does not impact the dependence of the bounds on the other. Moreover, our algorithm is parameter-free and does not require any (increasing) order of mini-batches to converge unlike the common practice in the analysis of stochastic conditional gradient-type algorithms. △ Less

Submitted 9 October, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: To appear in NeurIPS 2022

arXiv:2009.13016 [pdf, ps, other]

Escaping Saddle-Points Faster under Interpolation-like Conditions

Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra

Abstract: In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parame… ▽ More In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $ε$-local-minimizer, matches the corresponding deterministic rate of $\tilde{\mathcal{O}}(1/ε^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $ε$-local-minimizer under interpolation-like conditions, is $\tilde{\mathcal{O}}(1/ε^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $\tilde{\mathcal{O}}(1/ε^{1.5})$ corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings. △ Less

Submitted 27 September, 2020; originally announced September 2020.

Comments: To appear in NeurIPS, 2020

arXiv:2008.10526 [pdf, other]

Stochastic Multi-level Composition Optimization Algorithms with Level-Independent Convergence Rates

Authors: Krishnakumar Balasubramanian, Saeed Ghadimi, Anthony Nguyen

Abstract: In this paper, we study smooth stochastic multi-level composition optimization problems, where the objective function is a nested composition of $T$ functions. We assume access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle. For solving this class of problems, we propose two algorithms using moving-average stochastic estimates, and analyze their… ▽ More In this paper, we study smooth stochastic multi-level composition optimization problems, where the objective function is a nested composition of $T$ functions. We assume access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle. For solving this class of problems, we propose two algorithms using moving-average stochastic estimates, and analyze their convergence to an $ε$-stationary point of the problem. We show that the first algorithm, which is a generalization of \cite{GhaRuswan20} to the $T$ level case, can achieve a sample complexity of $\mathcal{O}(1/ε^6)$ by using mini-batches of samples in each iteration. By modifying this algorithm using linearized stochastic estimates of the function values, we improve the sample complexity to $\mathcal{O}(1/ε^4)$. {\color{black}This modification not only removes the requirement of having a mini-batch of samples in each iteration, but also makes the algorithm parameter-free and easy to implement}. To the best of our knowledge, this is the first time that such an online algorithm designed for the (un)constrained multi-level setting, obtains the same sample complexity of the smooth single-level setting, under standard assumptions (unbiasedness and boundedness of the second moments) on the stochastic first-order oracle. △ Less

Submitted 14 February, 2022; v1 submitted 24 August, 2020; originally announced August 2020.

Comments: Fixed some typos

arXiv:2006.08167 [pdf, other]

Improved Complexities for Stochastic Conditional Gradient Methods under Interpolation-like Conditions

Authors: Tesi Xiao, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires $\mathcal{O}(ε^{-2})$ call… ▽ More We analyze stochastic conditional gradient methods for constrained optimization problems arising in over-parametrized machine learning. We show that one could leverage the interpolation-like conditions satisfied by such models to obtain improved oracle complexities. Specifically, when the objective function is convex, we show that the conditional gradient method requires $\mathcal{O}(ε^{-2})$ calls to the stochastic gradient oracle to find an $ε$-optimal solution. Furthermore, by including a gradient sliding step, we show that the number of calls reduces to $\mathcal{O}(ε^{-1.5})$. △ Less

Submitted 26 January, 2022; v1 submitted 15 June, 2020; originally announced June 2020.

arXiv:1907.13616 [pdf, ps, other]

Multi-Point Bandit Algorithms for Nonstationary Online Nonconvex Optimization

Authors: Abhishek Roy, Krishnakumar Balasubramanian, Saeed Ghadimi, Prasant Mohapatra

Abstract: Bandit algorithms have been predominantly analyzed in the convex setting with function-value based stationary regret as the performance measure. In this paper, motivated by online reinforcement learning problems, we propose and analyze bandit algorithms for both general and structured nonconvex problems with nonstationary (or dynamic) regret as the performance measure, in both stochastic and non-s… ▽ More Bandit algorithms have been predominantly analyzed in the convex setting with function-value based stationary regret as the performance measure. In this paper, motivated by online reinforcement learning problems, we propose and analyze bandit algorithms for both general and structured nonconvex problems with nonstationary (or dynamic) regret as the performance measure, in both stochastic and non-stochastic settings. First, for general nonconvex functions, we consider nonstationary versions of first-order and second-order stationary solutions as a regret measure, motivated by similar performance measures for offline nonconvex optimization. In the case of second-order stationary solution based regret, we propose and analyze online and bandit versions of the cubic regularized Newton's method. The bandit version is based on estimating the Hessian matrices in the bandit setting, based on second-order Gaussian Stein's identity. Our nonstationary regret bounds in terms of second-order stationary solutions have interesting consequences for avoiding saddle points in the bandit setting. Next, for weakly quasi convex functions and monotone weakly submodular functions we consider nonstationary regret measures in terms of function-values; such structured classes of nonconvex functions enable one to consider regret measure defined in terms of function values, similar to convex functions. For this case of function-value, and first-order stationary solution based regret measures, we provide regret bounds in both the low- and high-dimensional settings, for some scenarios. △ Less

Submitted 11 September, 2019; v1 submitted 31 July, 2019; originally announced July 2019.

arXiv:1902.01373 [pdf, ps, other]

Stochastic Zeroth-order Discretizations of Langevin Diffusions for Bayesian Inference

Authors: Abhishek Roy, Lingqing Shen, Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: Discretizations of Langevin diffusions provide a powerful method for sampling and Bayesian inference. However, such discretizations require evaluation of the gradient of the potential function. In several real-world scenarios, obtaining gradient evaluations might either be computationally expensive, or simply impossible. In this work, we propose and analyze stochastic zeroth-order sampling algorit… ▽ More Discretizations of Langevin diffusions provide a powerful method for sampling and Bayesian inference. However, such discretizations require evaluation of the gradient of the potential function. In several real-world scenarios, obtaining gradient evaluations might either be computationally expensive, or simply impossible. In this work, we propose and analyze stochastic zeroth-order sampling algorithms for discretizing overdamped and underdamped Langevin diffusions. Our approach is based on estimating the gradients, based on Gaussian Stein's identities, widely used in the stochastic optimization literature. We provide a comprehensive sample complexity analysis -- number noisy function evaluations to be made to obtain an $ε$-approximate sample in Wasserstein distance -- of stochastic zeroth-order discretizations of both overdamped and underdamped Langevin diffusions, under various noise models. We also propose a variable selection technique based on zeroth-order gradient estimates and establish its theoretical guarantees. Our theoretical contributions extend the practical applicability of sampling algorithms to the noisy black-box and high-dimensional settings. △ Less

Submitted 17 January, 2021; v1 submitted 4 February, 2019; originally announced February 2019.

arXiv:1809.06474 [pdf, ps, other]

Zeroth-order Nonconvex Stochastic Optimization: Handling Constraints, High-Dimensionality and Saddle-Points

Authors: Krishnakumar Balasubramanian, Saeed Ghadimi

Abstract: In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization, with a focus on addressing constrained optimization, high-dimensional setting and saddle-point avoiding. To handle constrained optimization, we first propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient al… ▽ More In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization, with a focus on addressing constrained optimization, high-dimensional setting and saddle-point avoiding. To handle constrained optimization, we first propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient algorithm using only zeroth-order information. To facilitate zeroth-order optimization in high-dimensions, we explore the advantages of structural sparsity assumptions. Specifically, (i) we highlight an implicit regularization phenomenon where the standard stochastic gradient algorithm with zeroth-order information adapts to the sparsity of the problem at hand by just varying the step-size and (ii) propose a truncated stochastic gradient algorithm with zeroth-order information, whose rate of convergence depends only poly-logarithmically on the dimensionality. We next focus on avoiding saddle-points in non-convex setting. Towards that, we interpret the Gaussian smoothing technique for estimating gradient based on zeroth-order information as an instantiation of first-order Stein's identity. Based on this, we provide a novel linear-(in dimension) time estimator of the Hessian matrix of a function using only zeroth-order information, which is based on second-order Stein's identity. We then provide an algorithm for avoiding saddle-points, which is based on a zeroth-order cubic regularization Newton's method and discuss its convergence rates. △ Less

Submitted 13 January, 2019; v1 submitted 17 September, 2018; originally announced September 2018.

arXiv:1508.07384 [pdf, ps, other]

Generalized Uniformly Optimal Methods for Nonlinear Programming

Authors: Saeed Ghadimi, Guanghui Lan, Hongchao Zhang

Abstract: In this paper, we present a generic framework to extend existing uniformly optimal convex programming algorithms to solve more general nonlinear, possibly nonconvex, optimization problems. The basic idea is to incorporate a local search step (gradient descent or Quasi-Newton iteration) into these uniformly optimal convex programming methods, and then enforce a monotone decreasing property of the f… ▽ More In this paper, we present a generic framework to extend existing uniformly optimal convex programming algorithms to solve more general nonlinear, possibly nonconvex, optimization problems. The basic idea is to incorporate a local search step (gradient descent or Quasi-Newton iteration) into these uniformly optimal convex programming methods, and then enforce a monotone decreasing property of the function values computed along the trajectory. Algorithms of these types will then achieve the best known complexity for nonconvex problems, and the optimal complexity for convex ones without requiring any problem parameters. As a consequence, we can have a unified treatment for a general class of nonlinear programming problems regardless of their convexity and smoothness level. In particular, we show that the accelerated gradient and level methods, both originally designed for solving convex optimization problems only, can be used for solving both convex and nonconvex problems uniformly. In a similar vein, we show that some well-studied techniques for nonlinear programming, e.g., Quasi-Newton iteration, can be embedded into optimal convex optimization algorithms to possibly further enhance their numerical performance. Our theoretical and algorithmic developments are complemented by some promising numerical results obtained for solving a few important nonconvex and nonlinear data analysis problems in the literature. △ Less

Submitted 12 September, 2015; v1 submitted 28 August, 2015; originally announced August 2015.

arXiv:1309.5549 [pdf, ps, other]

Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming

Authors: Saeed Ghadimi, Guanghui Lan

Abstract: In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method posses… ▽ More In this paper, we introduce a new stochastic approximation (SA) type algorithm, namely the randomized stochastic gradient (RSG) method, for solving an important class of nonlinear (possibly nonconvex) stochastic programming (SP) problems. We establish the complexity of this method for computing an approximate stationary point of a nonlinear programming problem. We also show that this method possesses a nearly optimal rate of convergence if the problem is convex. We discuss a variant of the algorithm which consists of applying a post-optimization phase to evaluate a short list of solutions generated by several independent runs of the RSG method, and show that such modification allows to improve significantly the large-deviation properties of the algorithm. These methods are then specialized for solving a class of simulation-based optimization problems in which only stochastic zeroth-order information is available. △ Less

Submitted 21 September, 2013; originally announced September 2013.

Showing 1–12 of 12 results for author: Ghadimi, S