Search | arXiv e-print repository

A line search framework with restarting for noisy optimization problems

Authors: Albert S. Berahas, Michael J. O'Neill, Clément W. Royer

Abstract: Nonlinear optimization methods are typically iterative and make use of gradient information to determine a direction of improvement and function information to effectively check for progress. When this information is corrupted by noise, designing a convergent and practical algorithmic process becomes challenging, as care must be taken to avoid taking bad steps due to erroneous information. For thi… ▽ More Nonlinear optimization methods are typically iterative and make use of gradient information to determine a direction of improvement and function information to effectively check for progress. When this information is corrupted by noise, designing a convergent and practical algorithmic process becomes challenging, as care must be taken to avoid taking bad steps due to erroneous information. For this reason, simple gradient-based schemes have been quite popular, despite being outperformed by more advanced techniques in the noiseless setting. In this paper, we propose a general algorithmic framework based on line search that is endowed with iteration and evaluation complexity guarantees even in a noisy setting. These guarantees are obtained as a result of a restarting condition, that monitors desirable properties for the steps taken at each iteration and can be checked even in the presence of noise. Experiments using a nonlinear conjugate gradient variant and a quasi-Newton variant illustrate that restarting can be performed without compromising practical efficiency and robustness. △ Less

Submitted 3 June, 2025; originally announced June 2025.

arXiv:2505.19382 [pdf, ps, other]

Retrospective Approximation Sequential Quadratic Programming for Stochastic Optimization with General Deterministic Nonlinear Constraints

Authors: Albert S. Berahas, Raghu Bollapragada, Shagun Gupta

Abstract: In this paper, we propose a framework based on the Retrospective Approximation (RA) paradigm to solve optimization problems with a stochastic objective function and general nonlinear deterministic constraints. This framework sequentially constructs increasingly accurate approximations of the true problems which are solved to a specified accuracy via a deterministic solver, thereby decoupling the u… ▽ More In this paper, we propose a framework based on the Retrospective Approximation (RA) paradigm to solve optimization problems with a stochastic objective function and general nonlinear deterministic constraints. This framework sequentially constructs increasingly accurate approximations of the true problems which are solved to a specified accuracy via a deterministic solver, thereby decoupling the uncertainty from the optimization. Such frameworks retain the advantages of deterministic optimization methods, such as fast convergence, while achieving the optimal performance of stochastic methods without the need to redesign algorithmic components. For problems with general nonlinear equality constraints, we present a framework that can employ any deterministic solver and analyze its theoretical work complexity. We then present an instance of the framework that employs a deterministic Sequential Quadratic Programming (SQP) method and that achieves optimal complexity in terms of gradient evaluations and linear system solves for this class of problems. For problems with general nonlinear constraints, we present an RA-based algorithm that employs an SQP method with robust subproblems. Finally, we demonstrate the empirical performance of the proposed framework on multi-class logistic regression problems and benchmark instances from the CUTEst test set, comparing its results to established methods from the literature. △ Less

Submitted 25 May, 2025; originally announced May 2025.

Comments: 63 pages, 9 figures

MSC Class: 49M05; 49M37; 65K05; 90C06; 90C25; 90C30; 90C35

arXiv:2503.06702 [pdf, other]

Optimistic Noise-Aware Sequential Quadratic Programming for Equality Constrained Optimization with Rank-Deficient Jacobians

Authors: Albert S. Berahas, Jiahao Shi, Baoyu Zhou

Abstract: We propose and analyze a sequential quadratic programming algorithm for minimizing a noisy nonlinear smooth function subject to noisy nonlinear smooth equality constraints. The algorithm uses a step decomposition strategy and, as a result, is robust to potential rank-deficiency in the constraints, allows for two different step size strategies, and has an early stopping mechanism. Under the linear… ▽ More We propose and analyze a sequential quadratic programming algorithm for minimizing a noisy nonlinear smooth function subject to noisy nonlinear smooth equality constraints. The algorithm uses a step decomposition strategy and, as a result, is robust to potential rank-deficiency in the constraints, allows for two different step size strategies, and has an early stopping mechanism. Under the linear independence constraint qualification, convergence is established to a neighborhood of a first-order stationary point, where the radius of the neighborhood is proportional to the noise levels in the objective function and constraints. Moreover, in the rank-deficient setting, the merit parameter may converge to zero, and convergence to a neighborhood of an infeasible stationary point is established. Numerical experiments demonstrate the efficiency and robustness of the proposed method. △ Less

Submitted 9 March, 2025; originally announced March 2025.

arXiv:2411.10378 [pdf, other]

Exploiting Negative Curvature in Conjunction with Adaptive Sampling: Theoretical Results and a Practical Algorithm

Authors: Albert S. Berahas, Raghu Bollapragada, Wanping Dong

Abstract: In this paper, we propose algorithms that exploit negative curvature for solving noisy nonlinear nonconvex unconstrained optimization problems. We consider both deterministic and stochastic inexact settings, and develop two-step algorithms that combine directions of negative curvature and descent directions to update the iterates. Under reasonable assumptions, we prove second-order convergence res… ▽ More In this paper, we propose algorithms that exploit negative curvature for solving noisy nonlinear nonconvex unconstrained optimization problems. We consider both deterministic and stochastic inexact settings, and develop two-step algorithms that combine directions of negative curvature and descent directions to update the iterates. Under reasonable assumptions, we prove second-order convergence results and derive complexity guarantees for both settings. To tackle large-scale problems, we develop a practical variant that utilizes the conjugate gradient method with negative curvature detection and early stopping to compute a step, a simple adaptive step size scheme, and a strategy for selecting the sample sizes of the gradient and Hessian approximations as the optimization progresses. Numerical results on two machine learning problems showcase the efficacy and efficiency of the practical method. △ Less

Submitted 15 November, 2024; originally announced November 2024.

Comments: 39 pages, 6 figures

arXiv:2406.11144 [pdf, other]

Modified Line Search Sequential Quadratic Methods for Equality-Constrained Optimization with Unified Global and Local Convergence Guarantees

Authors: Albert S. Berahas, Raghu Bollapragada, Jiahao Shi

Abstract: In this paper, we propose a method that has foundations in the line search sequential quadratic programming paradigm for solving general nonlinear equality constrained optimization problems. The method employs a carefully designed modified line search strategy that utilizes second-order information of both the objective and constraint functions, as required, to mitigate the Maratos effect. Contrar… ▽ More In this paper, we propose a method that has foundations in the line search sequential quadratic programming paradigm for solving general nonlinear equality constrained optimization problems. The method employs a carefully designed modified line search strategy that utilizes second-order information of both the objective and constraint functions, as required, to mitigate the Maratos effect. Contrary to classical line search sequential quadratic programming methods, our proposed method is endowed with global convergence and local superlinear convergence guarantees. Moreover, we extend the method and analysis to the setting in which the constraint functions are deterministic but the objective function is stochastic or can be represented as a finite-sum. We also design and implement a practical inexact matrix-free variant of the method. Finally, numerical results illustrate the efficiency and efficacy of the method. △ Less

Submitted 26 July, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

arXiv:2404.14758 [pdf, other]

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

Authors: Sachin Garg, Albert S. Berahas, Michał Dereziński

Abstract: We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-or… ▽ More We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-order algorithm, called Mini-Batch Stochastic Variance-Reduced Newton ($\texttt{Mb-SVRN}$), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size $n$ is sufficiently large, i.e., $n\gg α^2κ$, where $κ$ is the condition number and $α$ is the Hessian approximation factor, then $\texttt{Mb-SVRN}$ achieves a fast linear convergence rate that is independent of the gradient mini-batch size $b$, as long $b$ is in the range between $1$ and $b_{\max}=O(n/(α\log n))$. Only after increasing the mini-batch size past this critical point $b_{\max}$, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of $\texttt{Mb-SVRN}$ remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point $b_{\max}$ on the Hessian approximation factor $α$ aligns with our theoretical predictions. △ Less

Submitted 23 April, 2024; originally announced April 2024.

MSC Class: 65K05; 90C06; 90C30

arXiv:2312.06814 [pdf, other]

A Flexible Gradient Tracking Algorithmic Framework for Decentralized Optimization

Authors: Albert S. Berahas, Raghu Bollapragada, Shagun Gupta

Abstract: In decentralized optimization over networks, each node in the network has a portion of the global objective function and the aim is to collectively optimize this function. Gradient tracking methods have emerged as a popular alternative for solving such problems due to their strong theoretical guarantees and robust empirical performance. These methods perform two operations (steps) at each iteratio… ▽ More In decentralized optimization over networks, each node in the network has a portion of the global objective function and the aim is to collectively optimize this function. Gradient tracking methods have emerged as a popular alternative for solving such problems due to their strong theoretical guarantees and robust empirical performance. These methods perform two operations (steps) at each iteration: (1) compute local gradients at each node, and (2) communicate local information with neighboring nodes. The complexity of these two steps can vary significantly across applications. In this work, we present a flexible gradient tracking algorithmic framework designed to balance the composition of communication and computation steps over the optimization process using a randomized scheme. The proposed framework is general, unifies gradient tracking methods, and recovers classical gradient tracking methods as special cases. We establish convergence guarantees in expectation and illustrate how the complexity of communication and computation steps can be balanced using the provided flexibility. Finally, we illustrate the performance of the proposed methods on quadratic and logistic regression problems, and compare against popular algorithms from the literature. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: 36 pages, 7 figures, 1 table

MSC Class: 49M05; 49M37; 65K05; 90C06; 90C25; 90C30; 90C35

arXiv:2311.08615 [pdf, other]

Non-Uniform Smoothness for Gradient Descent

Authors: Albert S. Berahas, Lindon Roberts, Fred Roosta

Abstract: The analysis of gradient descent-type methods typically relies on the Lipschitz continuity of the objective gradient. This generally requires an expensive hyperparameter tuning process to appropriately calibrate a stepsize for a given problem. In this work we introduce a local first-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradients smoothness condition and is appl… ▽ More The analysis of gradient descent-type methods typically relies on the Lipschitz continuity of the objective gradient. This generally requires an expensive hyperparameter tuning process to appropriately calibrate a stepsize for a given problem. In this work we introduce a local first-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradients smoothness condition and is applicable to any twice-differentiable function. We show that this oracle can encode all relevant problem information for tuning stepsizes for a suitably modified gradient descent method and give global and local convergence results. We also show that LFSOs in this modified first-order method can yield global linear convergence rates for non-strongly convex problems with extremely flat minima, and thus improve over the lower bound on rates achievable by general (accelerated) first-order methods. △ Less

Submitted 14 November, 2023; originally announced November 2023.

MSC Class: 65K05; 90C30

arXiv:2309.02626 [pdf, other]

Adaptive Consensus: A network pruning approach for decentralized optimization

Authors: Suhail M. Shah, Albert S. Berahas, Raghu Bollapragada

Abstract: We consider network-based decentralized optimization problems, where each node in the network possesses a local function and the objective is to collectively attain a consensus solution that minimizes the sum of all the local functions. A major challenge in decentralized optimization is the reliance on communication which remains a considerable bottleneck in many applications. To address this chal… ▽ More We consider network-based decentralized optimization problems, where each node in the network possesses a local function and the objective is to collectively attain a consensus solution that minimizes the sum of all the local functions. A major challenge in decentralized optimization is the reliance on communication which remains a considerable bottleneck in many applications. To address this challenge, we propose an adaptive randomized communication-efficient algorithmic framework that reduces the volume of communication by periodically tracking the disagreement error and judiciously selecting the most influential and effective edges at each node for communication. Within this framework, we present two algorithms: Adaptive Consensus (AC) to solve the consensus problem and Adaptive Consensus based Gradient Tracking (AC-GT) to solve smooth strongly convex decentralized optimization problems. We establish strong theoretical convergence guarantees for the proposed algorithms and quantify their performance in terms of various algorithmic parameters under standard assumptions. Finally, numerical experiments showcase the effectiveness of the framework in significantly reducing the information exchange required to achieve a consensus solution. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 35 pages, 3 figures

arXiv:2303.14289 [pdf, other]

Balancing Communication and Computation in Gradient Tracking Algorithms for Decentralized Optimization

Authors: Albert S. Berahas, Raghu Bollapragada, Shagun Gupta

Abstract: Gradient tracking methods have emerged as one of the most popular approaches for solving decentralized optimization problems over networks. In this setting, each node in the network has a portion of the global objective function, and the goal is to collectively optimize this function. At every iteration, gradient tracking methods perform two operations (steps): $(1)$ compute local gradients, and… ▽ More Gradient tracking methods have emerged as one of the most popular approaches for solving decentralized optimization problems over networks. In this setting, each node in the network has a portion of the global objective function, and the goal is to collectively optimize this function. At every iteration, gradient tracking methods perform two operations (steps): $(1)$ compute local gradients, and $(2)$ communicate information with local neighbors in the network. The complexity of these two steps varies across different applications. In this paper, we present a framework that unifies gradient tracking methods and is endowed with flexibility with respect to the number of communication and computation steps. We establish unified theoretical convergence results for the algorithmic framework with any composition of communication and computation steps, and quantify the improvements achieved as a result of this flexibility. The framework recovers the results of popular gradient tracking methods as special cases, and allows for a direct comparison of these methods. Finally, we illustrate the performance of the proposed methods on quadratic functions and binary classification problems. △ Less

Submitted 24 November, 2023; v1 submitted 24 March, 2023; originally announced March 2023.

Comments: 37 pages, 4 figures, 1 table

arXiv:2301.00477 [pdf, other]

A Sequential Quadratic Programming Method with High Probability Complexity Bounds for Nonlinear Equality Constrained Stochastic Optimization

Authors: Albert S. Berahas, Miaolan Xie, Baoyu Zhou

Abstract: A step-search sequential quadratic programming method is proposed for solving nonlinear equality constrained stochastic optimization problems. It is assumed that constraint function values and derivatives are available, but only stochastic approximations of the objective function and its associated derivatives can be computed via inexact probabilistic zeroth- and first-order oracles. Under reasona… ▽ More A step-search sequential quadratic programming method is proposed for solving nonlinear equality constrained stochastic optimization problems. It is assumed that constraint function values and derivatives are available, but only stochastic approximations of the objective function and its associated derivatives can be computed via inexact probabilistic zeroth- and first-order oracles. Under reasonable assumptions, a high-probability bound on the iteration complexity of the algorithm to approximate first-order stationarity is derived. Numerical results on standard nonlinear optimization test problems illustrate the advantages and limitations of our proposed method. △ Less

Submitted 5 October, 2024; v1 submitted 1 January, 2023; originally announced January 2023.

Comments: 29 pages, 2 figures

arXiv:2210.02418 [pdf, other]

Gradient Descent in the Absence of Global Lipschitz Continuity of the Gradients

Authors: Vivak Patel, Albert S. Berahas

Abstract: Gradient descent (GD) is a collection of continuous optimization methods that have achieved immeasurable success in practice. Owing to data science applications, GD with diminishing step sizes has become a prominent variant. While this variant of GD has been well-studied in the literature for objectives with globally Lipschitz continuous gradients or by requiring bounded iterates, objectives from… ▽ More Gradient descent (GD) is a collection of continuous optimization methods that have achieved immeasurable success in practice. Owing to data science applications, GD with diminishing step sizes has become a prominent variant. While this variant of GD has been well-studied in the literature for objectives with globally Lipschitz continuous gradients or by requiring bounded iterates, objectives from data science problems do not satisfy such assumptions. Thus, in this work, we provide a novel global convergence analysis of GD with diminishing step sizes for differentiable nonconvex functions whose gradients are only locally Lipschitz continuous. Through our analysis, we generalize what is known about gradient descent with diminishing step sizes including interesting topological facts; and we elucidate the varied behaviors that can occur in the previously overlooked divergence regime. Thus, we provide the most general global convergence analysis of GD with diminishing step sizes under realistic conditions for data science problems. △ Less

Submitted 24 June, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: 32 pages, 1 figure, 1 Table

MSC Class: 90C26; 68T99; 68W40

arXiv:2206.00712 [pdf, other]

An Adaptive Sampling Sequential Quadratic Programming Method for Equality Constrained Stochastic Optimization

Authors: Albert S. Berahas, Raghu Bollapragada, Baoyu Zhou

Abstract: This paper presents a methodology for using varying sample sizes in sequential quadratic programming (SQP) methods for solving equality constrained stochastic optimization problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the gradient in conjunction with inexact solutions to the SQP subproblems. Under reasonable assumptions on the… ▽ More This paper presents a methodology for using varying sample sizes in sequential quadratic programming (SQP) methods for solving equality constrained stochastic optimization problems. The first part of the paper deals with the delicate issue of dynamic sample selection in the evaluation of the gradient in conjunction with inexact solutions to the SQP subproblems. Under reasonable assumptions on the quality of the employed gradient approximations and the accuracy of the solutions to the SQP subproblems, we establish global convergence results for the proposed method. Motivated by these results, the second part of the paper describes a practical adaptive inexact stochastic sequential quadratic programming (PAIS-SQP) method. We propose criteria for controlling the sample size and the accuracy in the solutions of the SQP subproblems based on estimates of the variance in the stochastic gradient approximations obtained as the optimization progresses. Finally, we demonstrate the performance of the practical method on a subset of the CUTE problems and constrained classification tasks. △ Less

Submitted 21 March, 2023; v1 submitted 1 June, 2022; originally announced June 2022.

Comments: 58 pages, 9 figures, 1 table

arXiv:2205.03667 [pdf, ps, other]

First- and Second-Order High Probability Complexity Bounds for Trust-Region Methods with Noisy Oracles

Authors: Liyuan Cao, Albert S. Berahas, Katya Scheinberg

Abstract: In this paper, we present convergence guarantees for a modified trust-region method designed for minimizing objective functions whose value and gradient and Hessian estimates are computed with noise. These estimates are produced by generic stochastic oracles, which are not assumed to be unbiased or consistent. We introduce these oracles and show that they are more general and have more relaxed ass… ▽ More In this paper, we present convergence guarantees for a modified trust-region method designed for minimizing objective functions whose value and gradient and Hessian estimates are computed with noise. These estimates are produced by generic stochastic oracles, which are not assumed to be unbiased or consistent. We introduce these oracles and show that they are more general and have more relaxed assumptions than the stochastic oracles used in prior literature on stochastic trust-region methods. Our method utilizes a relaxed step acceptance criterion and a cautious trust-region radius updating strategy which allows us to derive exponentially decaying tail bounds on the iteration complexity for convergence to points that satisfy approximate first- and second-order optimality conditions. Finally, we present two sets of numerical results. We first explore the tightness of our theoretical results on an example with adversarial zeroth- and first-order oracles. We then investigate the performance of the modified trust-region algorithm on standard noisy derivative-free optimization problems. △ Less

Submitted 1 July, 2023; v1 submitted 7 May, 2022; originally announced May 2022.

Comments: 42 pages, 5 figures

arXiv:2204.04161 [pdf, other]

Accelerating Stochastic Sequential Quadratic Programming for Equality Constrained Optimization using Predictive Variance Reduction

Authors: Albert S. Berahas, Jiahao Shi, Zihong Yi, Baoyu Zhou

Abstract: In this paper, we propose a stochastic method for solving equality constrained optimization problems that utilizes predictive variance reduction. Specifically, we develop a method based on the sequential quadratic programming paradigm that employs variance reduction in the gradient approximations. Under reasonable assumptions, we prove that a measure of first-order stationarity evaluated at the it… ▽ More In this paper, we propose a stochastic method for solving equality constrained optimization problems that utilizes predictive variance reduction. Specifically, we develop a method based on the sequential quadratic programming paradigm that employs variance reduction in the gradient approximations. Under reasonable assumptions, we prove that a measure of first-order stationarity evaluated at the iterates generated by our proposed algorithm converges to zero in expectation from arbitrary starting points, for both constant and adaptive step size strategies. Finally, we demonstrate the practical performance of our proposed algorithm on constrained binary classification problems that arise in machine learning. △ Less

Submitted 24 March, 2023; v1 submitted 8 April, 2022; originally announced April 2022.

Comments: 42 pages, 5 figures, 4 tables

arXiv:2107.11908 [pdf, other]

Full-low evaluation methods for derivative-free optimization

Authors: Albert S. Berahas, Oumaima Sohab, Luis Nunes Vicente

Abstract: We propose a new class of rigorous methods for derivative-free optimization with the aim of delivering efficient and robust numerical performance for functions of all types, from smooth to non-smooth, and under different noise regimes. To this end, we have developed Full-Low Evaluation methods, organized around two main types of iterations. The first iteration type is expensive in function evaluat… ▽ More We propose a new class of rigorous methods for derivative-free optimization with the aim of delivering efficient and robust numerical performance for functions of all types, from smooth to non-smooth, and under different noise regimes. To this end, we have developed Full-Low Evaluation methods, organized around two main types of iterations. The first iteration type is expensive in function evaluations, but exhibits good performance in the smooth and non-noisy cases. For the theory, we consider a line search based on an approximate gradient, backtracking until a sufficient decrease condition is satisfied. In practice, the gradient was approximated via finite differences, and the direction was calculated by a quasi-Newton step (BFGS). The second iteration type is cheap in function evaluations, yet more robust in the presence of noise or non-smoothness. For the theory, we consider direct search, and in practice we use probabilistic direct search with one random direction and its negative. A switch condition from Full-Eval to Low-Eval iterations was developed based on the values of the line-search and direct-search stepsizes. If enough Full-Eval steps are taken, we derive a complexity result of gradient-descent type. Under failure of Full-Eval, the Low-Eval iterations become the drivers of convergence yielding non-smooth convergence. Full-Low Evaluation methods are shown to be efficient and robust in practice across problems with different levels of smoothness and noise. △ Less

Submitted 28 October, 2022; v1 submitted 25 July, 2021; originally announced July 2021.

arXiv:2106.13015 [pdf, other]

A Stochastic Sequential Quadratic Optimization Algorithm for Nonlinear Equality Constrained Optimization with Rank-Deficient Jacobians

Authors: Albert S. Berahas, Frank E. Curtis, Michael J. O'Neill, Daniel P. Robinson

Abstract: A sequential quadratic optimization algorithm is proposed for solving smooth nonlinear equality constrained optimization problems in which the objective function is defined by an expectation of a stochastic function. The algorithmic structure of the proposed method is based on a step decomposition strategy that is known in the literature to be widely effective in practice, wherein each search dire… ▽ More A sequential quadratic optimization algorithm is proposed for solving smooth nonlinear equality constrained optimization problems in which the objective function is defined by an expectation of a stochastic function. The algorithmic structure of the proposed method is based on a step decomposition strategy that is known in the literature to be widely effective in practice, wherein each search direction is computed as the sum of a normal step (toward linearized feasibility) and a tangential step (toward objective decrease in the null space of the constraint Jacobian). However, the proposed method is unique from others in the literature in that it both allows the use of stochastic objective gradient estimates and possesses convergence guarantees even in the setting in which the constraint Jacobians may be rank deficient. The results of numerical experiments demonstrate that the algorithm offers superior performance when compared to popular alternatives. △ Less

Submitted 16 March, 2023; v1 submitted 24 June, 2021; originally announced June 2021.

Report number: Lehigh ISE Technical Report 21T-013-R1

arXiv:2006.03949 [pdf, other]

SONIA: A Symmetric Blockwise Truncated Optimization Algorithm

Authors: Majid Jahani, Mohammadreza Nazari, Rachael Tappenden, Albert S. Berahas, Martin Takáč

Abstract: This work presents a new algorithm for empirical risk minimization. The algorithm bridges the gap between first- and second-order methods by computing a search direction that uses a second-order-type update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement. To this end, partial curvature information is incorporated to help with ill-conditioning, while simult… ▽ More This work presents a new algorithm for empirical risk minimization. The algorithm bridges the gap between first- and second-order methods by computing a search direction that uses a second-order-type update in one subspace, coupled with a scaled steepest descent step in the orthogonal complement. To this end, partial curvature information is incorporated to help with ill-conditioning, while simultaneously allowing the algorithm to scale to the large problem dimensions often encountered in machine learning applications. Theoretical results are presented to confirm that the algorithm converges to a stationary point in both the strongly convex and nonconvex cases. A stochastic variant of the algorithm is also presented, along with corresponding theoretical guarantees. Numerical results confirm the strengths of the new approach on standard machine learning problems. △ Less

Submitted 6 June, 2020; originally announced June 2020.

Comments: 38 pages, 74 figures

arXiv:2006.01892 [pdf, other]

Finite Difference Neural Networks: Fast Prediction of Partial Differential Equations

Authors: Zheng Shi, Nur Sila Gulgec, Albert S. Berahas, Shamim N. Pakzad, Martin Takáč

Abstract: Discovering the underlying behavior of complex systems is an important topic in many science and engineering disciplines. In this paper, we propose a novel neural network framework, finite difference neural networks (FDNet), to learn partial differential equations from data. Specifically, our proposed finite difference inspired network is designed to learn the underlying governing partial differen… ▽ More Discovering the underlying behavior of complex systems is an important topic in many science and engineering disciplines. In this paper, we propose a novel neural network framework, finite difference neural networks (FDNet), to learn partial differential equations from data. Specifically, our proposed finite difference inspired network is designed to learn the underlying governing partial differential equations from trajectory data, and to iteratively estimate the future dynamical behavior using only a few trainable parameters. We illustrate the performance (predictive power) of our framework on the heat equation, with and without noise and/or forcing, and compare our results to the Forward Euler method. Moreover, we show the advantages of using a Hessian-Free Trust Region method to train the network. △ Less

Submitted 2 June, 2020; originally announced June 2020.

Comments: 38 pages, 48 figures

arXiv:2006.01665 [pdf, other]

doi 10.1109/TSP.2021.3094906

On the Convergence of Nested Decentralized Gradient Methods with Multiple Consensus and Gradient Steps

Authors: Albert S. Berahas, Raghu Bollapragada, Ermin Wei

Abstract: In this paper, we consider minimizing a sum of local convex objective functions in a distributed setting, where the cost of communication and/or computation can be expensive. We extend and generalize the analysis for a class of nested gradient-based distributed algorithms (NEAR-DGD; Berahas, Bollapragada, Keskar and Wei, 2018) to account for multiple gradient steps at every iteration. We show the… ▽ More In this paper, we consider minimizing a sum of local convex objective functions in a distributed setting, where the cost of communication and/or computation can be expensive. We extend and generalize the analysis for a class of nested gradient-based distributed algorithms (NEAR-DGD; Berahas, Bollapragada, Keskar and Wei, 2018) to account for multiple gradient steps at every iteration. We show the effect of performing multiple gradient steps on the rate of convergence and on the size of the neighborhood of convergence, and prove R-Linear convergence to the exact solution with a fixed number of gradient steps and increasing number of consensus steps. We test the performance of the generalized method on quadratic functions and show the effect of multiple consensus and gradient steps in terms of iterations, number of gradient evaluations, number of communications and cost. △ Less

Submitted 7 July, 2021; v1 submitted 31 May, 2020; originally announced June 2020.

Comments: 12 pages, 4 figures. arXiv admin note: text overlap with arXiv:1903.08149

arXiv:1910.04055 [pdf, other]

Global Convergence Rate Analysis of a Generic Line Search Algorithm with Noise

Authors: Albert S. Berahas, Liyuan Cao, Katya Scheinberg

Abstract: In this paper, we develop convergence analysis of a modified line search method for objective functions whose value is computed with noise and whose gradient estimates are inexact and possibly random. The noise is assumed to be bounded in absolute value without any additional assumptions. We extend the framework based on stochastic methods from [Cartis and Scheinberg, 2018] which was developed to… ▽ More In this paper, we develop convergence analysis of a modified line search method for objective functions whose value is computed with noise and whose gradient estimates are inexact and possibly random. The noise is assumed to be bounded in absolute value without any additional assumptions. We extend the framework based on stochastic methods from [Cartis and Scheinberg, 2018] which was developed to provide analysis of a standard line search method with exact function values and random gradients to the case of noisy functions. We introduce two alternative conditions on the gradient which when satisfied with some sufficiently large probability at each iteration, guarantees convergence properties of the line search method. We derive expected complexity bounds to reach a near optimal neighborhood for convex, strongly convex and nonconvex functions. The exact dependence of the convergence neighborhood on the noise is specified. △ Less

Submitted 3 March, 2021; v1 submitted 7 October, 2019; originally announced October 2019.

Comments: 30 pages. arXiv admin note: text overlap with arXiv:1905.01332

arXiv:1905.13096 [pdf, other]

Scaling Up Quasi-Newton Algorithms: Communication Efficient Distributed SR1

Authors: Majid Jahani, Mohammadreza Nazari, Sergey Rusakov, Albert S. Berahas, Martin Takáč

Abstract: In this paper, we present a scalable distributed implementation of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant that: (i) drastically reduces the amount of data… ▽ More In this paper, we present a scalable distributed implementation of the Sampled Limited-memory Symmetric Rank-1 (S-LSR1) algorithm. First, we show that a naive distributed implementation of S-LSR1 requires multiple rounds of expensive communications at every iteration and thus is inefficient. We then propose DS-LSR1, a communication-efficient variant that: (i) drastically reduces the amount of data communicated at every iteration, (ii) has favorable work-load balancing across nodes, and (iii) is matrix-free and inverse-free. The proposed method scales well in terms of both the dimension of the problem and the number of data points. Finally, we illustrate the empirical performance of DS-LSR1 on a standard neural network training task. △ Less

Submitted 13 May, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

Comments: 24 pages, 14 figures, 6 tables

arXiv:1905.13043 [pdf, other]

Linear interpolation gives better gradients than Gaussian smoothing in derivative-free optimization

Authors: Albert S Berahas, Liyuan Cao, Krzysztof Choromanski, Katya Scheinberg

Abstract: In this paper, we consider derivative free optimization problems, where the objective function is smooth but is computed with some amount of noise, the function evaluations are expensive and no derivative information is available. We are motivated by policy optimization problems in reinforcement learning that have recently become popular [Choromaski et al. 2018; Fazel et al. 2018; Salimans et al.… ▽ More In this paper, we consider derivative free optimization problems, where the objective function is smooth but is computed with some amount of noise, the function evaluations are expensive and no derivative information is available. We are motivated by policy optimization problems in reinforcement learning that have recently become popular [Choromaski et al. 2018; Fazel et al. 2018; Salimans et al. 2016], and that can be formulated as derivative free optimization problems with the aforementioned characteristics. In each of these works some approximation of the gradient is constructed and a (stochastic) gradient method is applied. In [Salimans et al. 2016] the gradient information is aggregated along Gaussian directions, while in [Choromaski et al. 2018] it is computed along orthogonal direction. We provide a convergence rate analysis for a first-order line search method, similar to the ones used in the literature, and derive the conditions on the gradient approximations that ensure this convergence. We then demonstrate via rigorous analysis of the variance and by numerical comparisons on reinforcement learning tasks that the Gaussian sampling method used in [Salimans et al. 2016] is significantly inferior to the orthogonal sampling used in [Choromaski et al. 2018] as well as more general interpolation methods. △ Less

Submitted 2 June, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

Comments: 14 pages, 2 figures. arXiv admin note: text overlap with arXiv:1905.01332

arXiv:1905.01332 [pdf, other]

A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization

Authors: Albert S. Berahas, Liyuan Cao, Krzysztof Choromanski, Katya Scheinberg

Abstract: In this paper, we analyze several methods for approximating gradients of noisy functions using only function values. These methods include finite differences, linear interpolation, Gaussian smoothing and smoothing on a sphere. The methods differ in the number of functions sampled, the choice of the sample points, and the way in which the gradient approximations are derived. For each method, we der… ▽ More In this paper, we analyze several methods for approximating gradients of noisy functions using only function values. These methods include finite differences, linear interpolation, Gaussian smoothing and smoothing on a sphere. The methods differ in the number of functions sampled, the choice of the sample points, and the way in which the gradient approximations are derived. For each method, we derive bounds on the number of samples and the sampling radius which guarantee favorable convergence properties for a line search or fixed step size descent method. To this end, we use the results in [Berahas et al., 2019] and show how each method can satisfy the sufficient conditions, possibly only with some sufficiently large probability at each iteration, as happens to be the case with Gaussian smoothing and smoothing on a sphere. Finally, we present numerical results evaluating the quality of the gradient approximations as well as their performance in conjunction with a line search derivative-free optimization algorithm. △ Less

Submitted 25 March, 2021; v1 submitted 3 May, 2019; originally announced May 2019.

Comments: 42 pages, 7 figures, 4 tables

arXiv:1903.08149 [pdf, other]

Nested Distributed Gradient Methods with Adaptive Quantized Communication

Authors: Albert S. Berahas, Charikleia Iakovidou, Ermin Wei

Abstract: In this paper, we consider minimizing a sum of local convex objective functions in a distributed setting, where communication can be costly. We propose and analyze a class of nested distributed gradient methods with adaptive quantized communication (NEAR-DGD+Q). We show the effect of performing multiple quantized communication steps on the rate of convergence and on the size of the neighborhood of… ▽ More In this paper, we consider minimizing a sum of local convex objective functions in a distributed setting, where communication can be costly. We propose and analyze a class of nested distributed gradient methods with adaptive quantized communication (NEAR-DGD+Q). We show the effect of performing multiple quantized communication steps on the rate of convergence and on the size of the neighborhood of convergence, and prove R-Linear convergence to the exact solution with increasing number of consensus steps and adaptive quantization. We test the performance of the method, as well as some practical variants, on quadratic functions, and show the effects of multiple quantized communication steps in terms of iterations/gradient evaluations, communication and cost. △ Less

Submitted 26 August, 2019; v1 submitted 18 March, 2019; originally announced March 2019.

Comments: 9 pages, 2 figures. arXiv admin note: text overlap with arXiv:1709.02999

arXiv:1903.03471 [pdf, other]

Limited-Memory BFGS with Displacement Aggregation

Authors: Albert S. Berahas, Frank E. Curtis, Baoyu Zhou

Abstract: A displacement aggregation strategy is proposed for the curvature pairs stored in a limited-memory BFGS (a.k.a. L-BFGS) method such that the resulting (inverse) Hessian approximations are equal to those that would be derived from a full-memory BFGS method. This means that, if a sufficiently large number of pairs are stored, then an optimization algorithm employing the limited-memory method can ach… ▽ More A displacement aggregation strategy is proposed for the curvature pairs stored in a limited-memory BFGS (a.k.a. L-BFGS) method such that the resulting (inverse) Hessian approximations are equal to those that would be derived from a full-memory BFGS method. This means that, if a sufficiently large number of pairs are stored, then an optimization algorithm employing the limited-memory method can achieve the same theoretical convergence properties as when full-memory (inverse) Hessian approximations are stored and employed, such as a local superlinear rate of convergence under assumptions that are common for attaining such guarantees. To the best of our knowledge, this is the first work in which a local superlinear convergence rate guarantee is offered by a quasi-Newton scheme that does not either store all curvature pairs throughout the entire run of the optimization algorithm or store an explicit (inverse) Hessian approximation. Numerical results are presented to show that displacement aggregation within an adaptive L-BFGS scheme can lead to better performance than standard L-BFGS. △ Less

Submitted 25 August, 2020; v1 submitted 8 March, 2019; originally announced March 2019.

Report number: Lehigh University ISE/COR@L Technical Report 19T-001

arXiv:1901.09997 [pdf, other]

Quasi-Newton Methods for Machine Learning: Forget the Past, Just Sample

Authors: Albert S. Berahas, Majid Jahani, Peter Richtárik, Martin Takáč

Abstract: We present two sampled quasi-Newton methods (sampled LBFGS and sampled LSR1) for solving empirical risk minimization problems that arise in machine learning. Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to… ▽ More We present two sampled quasi-Newton methods (sampled LBFGS and sampled LSR1) for solving empirical risk minimization problems that arise in machine learning. Contrary to the classical variants of these methods that sequentially build Hessian or inverse Hessian approximations as the optimization progresses, our proposed methods sample points randomly around the current iterate at every iteration to produce these approximations. As a result, the approximations constructed make use of more reliable (recent and local) information, and do not depend on past iterate information that could be significantly stale. Our proposed algorithms are efficient in terms of accessed data points (epochs) and have enough concurrency to take advantage of parallel/distributed computing environments. We provide convergence guarantees for our proposed methods. Numerical tests on a toy classification problem as well as on popular benchmarking binary classification and neural network training tasks reveal that the methods outperform their classical variants. △ Less

Submitted 27 July, 2021; v1 submitted 28 January, 2019; originally announced January 2019.

Comments: 50 pages, 33 figures

arXiv:1803.10173 [pdf, other]

Derivative-Free Optimization of Noisy Functions via Quasi-Newton Methods

Authors: Albert S. Berahas, Richard H. Byrd, Jorge Nocedal

Abstract: This paper presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval $h$ based on the noise estimation techniques of Hamming (2012) and Moré and Wild (2011). This noise estimation procedure and the selection of $h$ are in… ▽ More This paper presents a finite difference quasi-Newton method for the minimization of noisy functions. The method takes advantage of the scalability and power of BFGS updating, and employs an adaptive procedure for choosing the differencing interval $h$ based on the noise estimation techniques of Hamming (2012) and Moré and Wild (2011). This noise estimation procedure and the selection of $h$ are inexpensive but not always accurate, and to prevent failures the algorithm incorporates a recovery mechanism that takes appropriate action in the case when the line search procedure is unable to produce an acceptable point. A novel convergence analysis is presented that considers the effect of a noisy line search procedure. Numerical experiments comparing the method to a function interpolating trust region method are presented. △ Less

Submitted 8 January, 2019; v1 submitted 27 March, 2018; originally announced March 2018.

Comments: 26 pages, 9 figures

arXiv:1709.02999 [pdf, other]

Balancing Communication and Computation in Distributed Optimization

Authors: Albert S. Berahas, Raghu Bollapragada, Nitish Shirish Keskar, Ermin Wei

Abstract: Methods for distributed optimization have received significant attention in recent years owing to their wide applicability in various domains. A distributed optimization method typically consists of two key components: communication and computation. More specifically, at every iteration (or every several iterations) of a distributed algorithm, each node in the network requires some form of informa… ▽ More Methods for distributed optimization have received significant attention in recent years owing to their wide applicability in various domains. A distributed optimization method typically consists of two key components: communication and computation. More specifically, at every iteration (or every several iterations) of a distributed algorithm, each node in the network requires some form of information exchange with its neighboring nodes (communication) and the computation step related to a (sub)-gradient (computation). The standard way of judging an algorithm via only the number of iterations overlooks the complexity associated with each iteration. Moreover, various applications deploying distributed methods may prefer a different composition of communication and computation. Motivated by this discrepancy, in this work we propose an adaptive cost framework which adjusts the cost measure depending on the features of various applications. We present a flexible algorithmic framework, where communication and computation steps are explicitly decomposed to enable algorithm customization for various applications. We apply this framework to the well-known distributed gradient descent (DGD) method, and show that the resulting customized algorithms, which we call DGD$^t$, NEAR-DGD$^t$ and NEAR-DGD$^+$, compare favorably to their base algorithms, both theoretically and empirically. The proposed NEAR-DGD$^+$ algorithm is an exact first-order method where the communication and computation steps are nested, and when the number of communication steps is adaptively increased, the method converges to the optimal solution. We test the performance and illustrate the flexibility of the methods, as well as practical variants, on quadratic functions and classification problems that arise in machine learning, in terms of iterations, gradient evaluations, communications and the proposed cost framework. △ Less

Submitted 31 May, 2018; v1 submitted 9 September, 2017; originally announced September 2017.

Comments: 16 pages, 4 figures. Accepted to IEEE Transactions on Automatic Control

arXiv:1707.08552 [pdf, other]

A Robust Multi-Batch L-BFGS Method for Machine Learning

Authors: Albert S. Berahas, Martin Takáč

Abstract: This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function a… ▽ More This paper describes an implementation of the L-BFGS method designed to deal with two adversarial situations. The first occurs in distributed computing environments where some of the computational nodes devoted to the evaluation of the function and gradient are unable to return results on time. A similar challenge occurs in a multi-batch approach in which the data points used to compute function and gradients are purposely changed at each iteration to accelerate the learning process. Difficulties arise because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the updating process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, studies the convergence properties for both convex and nonconvex functions, and illustrates the behavior of the algorithm in a distributed computing platform on binary classification logistic regression and neural network training problems that arise in machine learning. △ Less

Submitted 27 August, 2019; v1 submitted 26 July, 2017; originally announced July 2017.

Comments: 50 pages, 33 figures. Extension of NIPS 2016 paper: arXiv:1605.06049

arXiv:1705.06211 [pdf, other]

An Investigation of Newton-Sketch and Subsampled Newton Methods

Authors: Albert S. Berahas, Raghu Bollapragada, Jorge Nocedal

Abstract: Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton's method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomi… ▽ More Sketching, a dimensionality reduction technique, has received much attention in the statistics community. In this paper, we study sketching in the context of Newton's method for solving finite-sum optimization problems in which the number of variables and data points are both large. We study two forms of sketching that perform dimensionality reduction in data space: Hessian subsampling and randomized Hadamard transformations. Each has its own advantages, and their relative tradeoffs have not been investigated in the optimization literature. Our study focuses on practical versions of the two methods in which the resulting linear systems of equations are solved approximately, at every iteration, using an iterative solver. The advantages of using the conjugate gradient method vs. a stochastic gradient iteration are revealed through a set of numerical experiments, and a complexity analysis of the Hessian subsampling method is presented. △ Less

Submitted 30 May, 2019; v1 submitted 17 May, 2017; originally announced May 2017.

Comments: 36 pages, 22 figures

arXiv:1605.06049 [pdf, other]

A Multi-Batch L-BFGS Method for Machine Learning

Authors: Albert S. Berahas, Jorge Nocedal, Martin Takáč

Abstract: The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the… ▽ More The question of how to parallelize the stochastic gradient descent (SGD) method has received much attention in the literature. In this paper, we focus instead on batch methods that use a sizeable fraction of the training set at each iteration to facilitate parallelism, and that employ second-order information. In order to improve the learning process, we follow a multi-batch approach in which the batch changes at each iteration. This can cause difficulties because L-BFGS employs gradient differences to update the Hessian approximations, and when these gradients are computed using different data points the process can be unstable. This paper shows how to perform stable quasi-Newton updating in the multi-batch setting, illustrates the behavior of the algorithm in a distributed computing platform, and studies its convergence properties for both the convex and nonconvex cases. △ Less

Submitted 23 October, 2016; v1 submitted 19 May, 2016; originally announced May 2016.

Comments: NIPS 2016. 31 pages, 22 figures

arXiv:1511.01169 [pdf, ps, other]

adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

Authors: Nitish Shirish Keskar, Albert S. Berahas

Abstract: Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional performance on several pattern recognition problems. However, the training of RNNs is a computationally difficult task owing to the well-known "vanishing/exploding" gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap per-iteration complexity, or atte… ▽ More Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional performance on several pattern recognition problems. However, the training of RNNs is a computationally difficult task owing to the well-known "vanishing/exploding" gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap per-iteration complexity, or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includes diagonally-scaled first-order methods such as ADAGRAD and ADAM, while the latter consists of second-order algorithms like Hessian-Free Newton and K-FAC. In this paper, we present adaQN, a stochastic quasi-Newton algorithm for training RNNs. Our approach retains a low per-iteration cost while allowing for non-diagonal scaling through a stochastic L-BFGS updating scheme. The method uses a novel L-BFGS scaling initialization scheme and is judicious in storing and retaining L-BFGS curvature pairs. We present numerical experiments on two language modeling tasks and show that adaQN is competitive with popular RNN training algorithms. △ Less

Submitted 23 February, 2016; v1 submitted 3 November, 2015; originally announced November 2015.

Showing 1–33 of 33 results for author: Berahas, A S