-
Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes
Authors:
Si Yi Meng,
Antonio Orvieto,
Daniel Yiming Cao,
Christopher De Sa
Abstract:
We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical…
▽ More
We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/λ$, where $λ$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/λ$ suffices for global convergence. However, for all step sizes between $1/λ$ and the critical step size $2/λ$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/λ$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.
△ Less
Submitted 4 November, 2024; v1 submitted 7 June, 2024;
originally announced June 2024.
-
Inference for Probabilistic Dependency Graphs
Authors:
Oliver E. Richardson,
Joseph Y. Halpern,
Christopher De Sa
Abstract:
Probabilistic dependency graphs (PDGs) are a flexible class of probabilistic graphical models, subsuming Bayesian Networks and Factor Graphs. They can also capture inconsistent beliefs, and provide a way of measuring the degree of this inconsistency. We present the first tractable inference algorithm for PDGs with discrete variables, making the asymptotic complexity of PDG inference similar that o…
▽ More
Probabilistic dependency graphs (PDGs) are a flexible class of probabilistic graphical models, subsuming Bayesian Networks and Factor Graphs. They can also capture inconsistent beliefs, and provide a way of measuring the degree of this inconsistency. We present the first tractable inference algorithm for PDGs with discrete variables, making the asymptotic complexity of PDG inference similar that of the graphical models they generalize. The key components are: (1) the observation that, in many cases, the distribution a PDG specifies can be formulated as a convex optimization problem (with exponential cone constraints), (2) a construction that allows us to express these problems compactly for PDGs of boundeed treewidth, (3) contributions to the theory of PDGs that justify the construction, and (4) an appeal to interior point methods that can solve such problems in polynomial time. We verify the correctness and complexity of our approach, and provide an implementation of it. We then evaluate our implementation, and demonstrate that it outperforms baseline approaches. Our code is available at http://github.com/orichardson/pdg-infer-uai.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Coordinating Distributed Example Orders for Provably Accelerated Training
Authors:
A. Feder Cooper,
Wentao Guo,
Khiem Pham,
Tiancheng Yuan,
Charlie F. Ruan,
Yucheng Lu,
Christopher De Sa
Abstract:
Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: whil…
▽ More
Recent research on online Gradient Balancing (GraB) has revealed that there exist permutation-based example orderings for SGD that are guaranteed to outperform random reshuffling (RR). Whereas RR arbitrarily permutes training examples, GraB leverages stale gradients from prior epochs to order examples -- achieving a provably faster convergence rate than RR. However, GraB is limited by design: while it demonstrates an impressive ability to scale-up training on centralized data, it does not naturally extend to modern distributed ML workloads. We therefore propose Coordinated Distributed GraB (CD-GraB), which uses insights from prior work on kernel thinning to translate the benefits of provably faster permutation-based example ordering to distributed settings. With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
△ Less
Submitted 21 December, 2023; v1 submitted 1 February, 2023;
originally announced February 2023.
-
From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent
Authors:
Satyen Kale,
Jason D. Lee,
Chris De Sa,
Ayush Sekhari,
Karthik Sridharan
Abstract:
Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is provi…
▽ More
Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is providing general conditions under which SGD converges, assuming that GF on the population loss converges. Our main tool to establish this connection is a general converse Lyapunov like theorem, which implies the existence of a Lyapunov potential under mild assumptions on the rates of convergence of GF. In fact, using these potentials, we show a one-to-one correspondence between rates of convergence of GF and geometrical properties of the underlying objective. When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart). It turns out that these self-bounding assumptions are in a sense also necessary for GD/SGD to work. Using our framework, we provide a unified analysis for GD/SGD not only for classical settings like convex losses, or objectives that satisfy PL / KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root, and extending the results in the recent work of Chatterjee 2022.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Equivariant Manifold Flows
Authors:
Isay Katsman,
Aaron Lou,
Derek Lim,
Qingxuan Jiang,
Ser-Nam Lim,
Christopher De Sa
Abstract:
Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries -- a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning…
▽ More
Tractably modelling distributions over manifolds has long been an important goal in the natural sciences. Recent work has focused on developing general machine learning models to learn such distributions. However, for many applications these distributions must respect manifold symmetries -- a trait which most previous models disregard. In this paper, we lay the theoretical foundations for learning symmetry-invariant distributions on arbitrary manifolds via equivariant manifold flows. We demonstrate the utility of our approach by using it to learn gauge invariant densities over $SU(n)$ in the context of quantum field theory.
△ Less
Submitted 27 January, 2022; v1 submitted 18 July, 2021;
originally announced July 2021.
-
Neural Manifold Ordinary Differential Equations
Authors:
Aaron Lou,
Derek Lim,
Isay Katsman,
Leo Huang,
Qingxuan Jiang,
Ser-Nam Lim,
Christopher De Sa
Abstract:
To better conform to data geometry, recent deep generative modelling techniques adapt Euclidean constructions to non-Euclidean spaces. In this paper, we study normalizing flows on manifolds. Previous work has developed flow models for specific cases; however, these advancements hand craft layers on a manifold-by-manifold basis, restricting generality and inducing cumbersome design constraints. We…
▽ More
To better conform to data geometry, recent deep generative modelling techniques adapt Euclidean constructions to non-Euclidean spaces. In this paper, we study normalizing flows on manifolds. Previous work has developed flow models for specific cases; however, these advancements hand craft layers on a manifold-by-manifold basis, restricting generality and inducing cumbersome design constraints. We overcome these issues by introducing Neural Manifold Ordinary Differential Equations, a manifold generalization of Neural ODEs, which enables the construction of Manifold Continuous Normalizing Flows (MCNFs). MCNFs require only local geometry (therefore generalizing to arbitrary manifolds) and compute probabilities with continuous change of variables (allowing for a simple and expressive flow construction). We find that leveraging continuous manifold dynamics produces a marked improvement for both density estimation and downstream tasks.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
Accelerated Stochastic Power Iteration
Authors:
Christopher De Sa,
Bryan He,
Ioannis Mitliagkas,
Christopher Ré,
Peng Xu
Abstract:
Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, requires $\mathcal O(1/Δ)$ full-data passes to recover the principal component of a matrix with eigen-gap $Δ$. Lanczos, a significantly more complex method, achieves an accelerated rate of $\mathcal O(1/\sqrtΔ)$ passes. Modern applications, however, motivate m…
▽ More
Principal component analysis (PCA) is one of the most powerful tools in machine learning. The simplest method for PCA, the power iteration, requires $\mathcal O(1/Δ)$ full-data passes to recover the principal component of a matrix with eigen-gap $Δ$. Lanczos, a significantly more complex method, achieves an accelerated rate of $\mathcal O(1/\sqrtΔ)$ passes. Modern applications, however, motivate methods that only ingest a subset of available data, known as the stochastic setting. In the online stochastic setting, simple algorithms like Oja's iteration achieve the optimal sample complexity $\mathcal O(σ^2/Δ^2)$. Unfortunately, they are fully sequential, and also require $\mathcal O(σ^2/Δ^2)$ iterations, far from the $\mathcal O(1/\sqrtΔ)$ rate of Lanczos. We propose a simple variant of the power iteration with an added momentum term, that achieves both the optimal sample and iteration complexity. In the full-pass setting, standard analysis shows that momentum achieves the accelerated rate, $\mathcal O(1/\sqrtΔ)$. We demonstrate empirically that naively applying momentum to a stochastic method, does not result in acceleration. We perform a novel, tight variance analysis that reveals the "breaking-point variance" beyond which this acceleration does not occur. By combining this insight with modern variance reduction techniques, we construct stochastic PCA algorithms, for the online and offline setting, that achieve an accelerated iteration complexity $\mathcal O(1/\sqrtΔ)$. Due to the embarassingly parallel nature of our methods, this acceleration translates directly to wall-clock time if deployed in a parallel environment. Our approach is very general, and applies to many non-convex optimization problems that can now be accelerated using the same technique.
△ Less
Submitted 9 July, 2017;
originally announced July 2017.
-
Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms
Authors:
Christopher De Sa,
Ce Zhang,
Kunle Olukotun,
Christopher Ré
Abstract:
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifica…
▽ More
Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variety of machine learning problems. Researchers and industry have developed several techniques to optimize SGD's runtime performance, including asynchronous execution and reduced precision. Our main result is a martingale-based analysis that enables us to capture the rich noise models that may arise from such techniques. Specifically, we use our new analysis in three ways: (1) we derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem; (2) we analyze asynchronous SGD algorithms for non-convex matrix problems including matrix completion; and (3) we design and analyze an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic. We show experimentally that our algorithms run efficiently for a variety of problems on modern hardware.
△ Less
Submitted 2 October, 2015; v1 submitted 21 June, 2015;
originally announced June 2015.
-
Global Convergence of Stochastic Gradient Descent for Some Non-convex Matrix Problems
Authors:
Christopher De Sa,
Kunle Olukotun,
Christopher Ré
Abstract:
Stochastic gradient descent (SGD) on a low-rank factorization is commonly employed to speed up matrix problems including matrix completion, subspace tracking, and SDP relaxation. In this paper, we exhibit a step size scheme for SGD on a low-rank least-squares problem, and we prove that, under broad sampling conditions, our method converges globally from a random starting point within…
▽ More
Stochastic gradient descent (SGD) on a low-rank factorization is commonly employed to speed up matrix problems including matrix completion, subspace tracking, and SDP relaxation. In this paper, we exhibit a step size scheme for SGD on a low-rank least-squares problem, and we prove that, under broad sampling conditions, our method converges globally from a random starting point within $O(ε^{-1} n \log n)$ steps with constant probability for constant-rank problems. Our modification of SGD relates it to stochastic power iteration. We also show experiments to illustrate the runtime and convergence of the algorithm.
△ Less
Submitted 10 February, 2015; v1 submitted 4 November, 2014;
originally announced November 2014.