-
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
Authors:
Hao-Jun Michael Shi,
Tsung-Hsien Lee,
Shintaro Iwasaki,
Jose Gallego-Posada,
Zhijing Li,
Kaushik Rangadurai,
Dheevatsa Mudigere,
Michael Rabbat
Abstract:
Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the perform…
▽ More
Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step wall-clock time compared against standard diagonal-scaling-based adaptive gradient methods. We validate our implementation by performing an ablation study on training ImageNet ResNet50, demonstrating Shampoo's superiority over standard training recipes with minimal hyperparameter tuning.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Adaptive Finite-Difference Interval Estimation for Noisy Derivative-Free Optimization
Authors:
Hao-Jun Michael Shi,
Yuchen Xie,
Melody Qiming Xuan,
Jorge Nocedal
Abstract:
A common approach for minimizing a smooth nonlinear function is to employ finite-difference approximations to the gradient. While this can be easily performed when no error is present within the function evaluations, when the function is noisy, the optimal choice requires information about the noise level and higher-order derivatives of the function, which is often unavailable. Given the noise lev…
▽ More
A common approach for minimizing a smooth nonlinear function is to employ finite-difference approximations to the gradient. While this can be easily performed when no error is present within the function evaluations, when the function is noisy, the optimal choice requires information about the noise level and higher-order derivatives of the function, which is often unavailable. Given the noise level of the function, we propose a bisection search for finding a finite-difference interval for any finite-difference scheme that balances the truncation error, which arises from the error in the Taylor series approximation, and the measurement error, which results from noise in the function evaluation. Our procedure produces reliable estimates of the finite-difference interval at low cost without explicitly approximating higher-order derivatives. We show its numerical reliability and accuracy on a set of test problems. When combined with L-BFGS, we obtain a robust method for minimizing noisy black-box functions, as illustrated on a subset of unconstrained CUTEst problems with synthetically added noise.
△ Less
Submitted 22 March, 2022; v1 submitted 12 October, 2021;
originally announced October 2021.
-
On the Numerical Performance of Derivative-Free Optimization Methods Based on Finite-Difference Approximations
Authors:
Hao-Jun Michael Shi,
Melody Qiming Xuan,
Figen Oztoprak,
Jorge Nocedal
Abstract:
The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calcu…
▽ More
The goal of this paper is to investigate an approach for derivative-free optimization that has not received sufficient attention in the literature and is yet one of the simplest to implement and parallelize. It consists of computing gradients of a smoothed approximation of the objective function (and constraints), and employing them within established codes. These gradient approximations are calculated by finite differences, with a differencing interval determined by the noise level in the functions and a bound on the second or third derivatives. It is assumed that noise level is known or can be estimated by means of difference tables or sampling. The use of finite differences has been largely dismissed in the derivative-free optimization literature as too expensive in terms of function evaluations and/or as impractical when the objective function contains noise. The test results presented in this paper suggest that such views should be re-examined and that the finite-difference approach has much to be recommended. The tests compared NEWUOA, DFO-LS and COBYLA against the finite-difference approach on three classes of problems: general unconstrained problems, nonlinear least squares, and general nonlinear programs with equality constraints.
△ Less
Submitted 19 February, 2021;
originally announced February 2021.
-
A Noise-Tolerant Quasi-Newton Algorithm for Unconstrained Optimization
Authors:
Hao-Jun Michael Shi,
Yuchen Xie,
Richard Byrd,
Jorge Nocedal
Abstract:
This paper describes an extension of the BFGS and L-BFGS methods for the minimization of a nonlinear function subject to errors. This work is motivated by applications that contain computational noise, employ low-precision arithmetic, or are subject to statistical noise. The classical BFGS and L-BFGS methods can fail in such circumstances because the updating procedure can be corrupted and the lin…
▽ More
This paper describes an extension of the BFGS and L-BFGS methods for the minimization of a nonlinear function subject to errors. This work is motivated by applications that contain computational noise, employ low-precision arithmetic, or are subject to statistical noise. The classical BFGS and L-BFGS methods can fail in such circumstances because the updating procedure can be corrupted and the line search can behave erratically. The proposed method addresses these difficulties and ensures that the BFGS update is stable by employing a lengthening procedure that spaces out the points at which gradient differences are collected. A new line search, designed to tolerate errors, guarantees that the Armijo-Wolfe conditions are satisfied under most reasonable conditions, and works in conjunction with the lengthening procedure. The proposed methods are shown to enjoy convergence guarantees for strongly convex functions. Detailed implementations of the methods are presented, together with encouraging numerical results.
△ Less
Submitted 8 September, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Search for a moving target in a competitive environment
Authors:
Benoit Duvocelle,
János Flesch,
Hui Min Shi,
Dries Vermeulen
Abstract:
We consider a discrete-time dynamic search game in which a number of players compete to find an invisible object that is moving according to a time-varying Markov chain. We examine the subgame perfect equilibria of these games. The main result of the paper is that the set of subgame perfect equilibria is exactly the set of greedy strategy profiles, i.e. those strategy profiles in which the players…
▽ More
We consider a discrete-time dynamic search game in which a number of players compete to find an invisible object that is moving according to a time-varying Markov chain. We examine the subgame perfect equilibria of these games. The main result of the paper is that the set of subgame perfect equilibria is exactly the set of greedy strategy profiles, i.e. those strategy profiles in which the players always choose an action that maximizes their probability of immediately finding the object. We discuss various variations and extensions of the model.
△ Less
Submitted 25 August, 2020; v1 submitted 21 August, 2020;
originally announced August 2020.
-
A Progressive Batching L-BFGS Method for Machine Learning
Authors:
Raghu Bollapragada,
Dheevatsa Mudigere,
Jorge Nocedal,
Hao-Jun Michael Shi,
Ping Tak Peter Tang
Abstract:
The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization pr…
▽ More
The standard L-BFGS method relies on gradient approximations that are not dominated by noise, so that search directions are descent directions, the line search is reliable, and quasi-Newton updating yields useful quadratic models of the objective function. All of this appears to call for a full batch approach, but since small batch sizes give rise to faster algorithms with better generalization properties, L-BFGS is currently not considered an algorithm of choice for large-scale machine learning applications. One need not, however, choose between the two extremes represented by the full batch or highly stochastic regimes, and may instead follow a progressive batching approach in which the sample size increases during the course of the optimization. In this paper, we present a new version of the L-BFGS algorithm that combines three basic components - progressive batching, a stochastic line search, and stable quasi-Newton updating - and that performs well on training logistic regression and deep neural networks. We provide supporting convergence theory for the method.
△ Less
Submitted 30 May, 2018; v1 submitted 14 February, 2018;
originally announced February 2018.
-
A Primer on Coordinate Descent Algorithms
Authors:
Hao-Jun Michael Shi,
Shenyinying Tu,
Yangyang Xu,
Wotao Yin
Abstract:
This monograph presents a class of algorithms called coordinate descent algorithms for mathematicians, statisticians, and engineers outside the field of optimization. This particular class of algorithms has recently gained popularity due to their effectiveness in solving large-scale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. Coord…
▽ More
This monograph presents a class of algorithms called coordinate descent algorithms for mathematicians, statisticians, and engineers outside the field of optimization. This particular class of algorithms has recently gained popularity due to their effectiveness in solving large-scale optimization problems in machine learning, compressed sensing, image processing, and computational statistics. Coordinate descent algorithms solve optimization problems by successively minimizing along each coordinate or coordinate hyperplane, which is ideal for parallelized and distributed computing. Avoiding detailed technicalities and proofs, this monograph gives relevant theory and examples for practitioners to effectively apply coordinate descent to modern problems in data science and engineering.
△ Less
Submitted 12 January, 2017; v1 submitted 30 September, 2016;
originally announced October 2016.
-
Practical Algorithms for Learning Near-Isometric Linear Embeddings
Authors:
Jerry Luo,
Kayla Shapiro,
Hao-Jun Michael Shi,
Qi Yang,
Kan Zhu
Abstract:
We propose two practical non-convex approaches for learning near-isometric, linear embeddings of finite sets of data points. Given a set of training points $\mathcal{X}$, we consider the secant set $S(\mathcal{X})$ that consists of all pairwise difference vectors of $\mathcal{X}$, normalized to lie on the unit sphere. The problem can be formulated as finding a symmetric and positive semi-definite…
▽ More
We propose two practical non-convex approaches for learning near-isometric, linear embeddings of finite sets of data points. Given a set of training points $\mathcal{X}$, we consider the secant set $S(\mathcal{X})$ that consists of all pairwise difference vectors of $\mathcal{X}$, normalized to lie on the unit sphere. The problem can be formulated as finding a symmetric and positive semi-definite matrix $\boldsymbolΨ$ that preserves the norms of all the vectors in $S(\mathcal{X})$ up to a distortion parameter $δ$. Motivated by non-negative matrix factorization, we reformulate our problem into a Frobenius norm minimization problem, which is solved by the Alternating Direction Method of Multipliers (ADMM) and develop an algorithm, FroMax. Another method solves for a projection matrix $\boldsymbolΨ$ by minimizing the restricted isometry property (RIP) directly over the set of symmetric, postive semi-definite matrices. Applying ADMM and a Moreau decomposition on a proximal mapping, we develop another algorithm, NILE-Pro, for dimensionality reduction. FroMax is shown to converge faster for smaller $δ$ while NILE-Pro converges faster for larger $δ$. Both non-convex approaches are then empirically demonstrated to be more computationally efficient than prior convex approaches for a number of applications in machine learning and signal processing.
△ Less
Submitted 22 April, 2016; v1 submitted 1 January, 2016;
originally announced January 2016.
-
Methods for Quantized Compressed Sensing
Authors:
Hao-Jun Michael Shi,
Mindy Case,
Xiaoyi Gu,
Shenyinying Tu,
Deanna Needell
Abstract:
In this paper, we compare and catalog the performance of various greedy quantized compressed sensing algorithms that reconstruct sparse signals from quantized compressed measurements. We also introduce two new greedy approaches for reconstruction: Quantized Compressed Sampling Matching Pursuit (QCoSaMP) and Adaptive Outlier Pursuit for Quantized Iterative Hard Thresholding (AOP-QIHT). We compare t…
▽ More
In this paper, we compare and catalog the performance of various greedy quantized compressed sensing algorithms that reconstruct sparse signals from quantized compressed measurements. We also introduce two new greedy approaches for reconstruction: Quantized Compressed Sampling Matching Pursuit (QCoSaMP) and Adaptive Outlier Pursuit for Quantized Iterative Hard Thresholding (AOP-QIHT). We compare the performance of greedy quantized compressed sensing algorithms for a given bit-depth, sparsity, and noise level.
△ Less
Submitted 30 December, 2015;
originally announced December 2015.