Skip to main content

Showing 1–50 of 57 results for author: Gurbuzbalaban, M

.
  1. arXiv:2502.07977  [pdf, other

    cs.LG math.OC stat.ML

    RESIST: Resilient Decentralized Learning Using Consensus Gradient Descent

    Authors: Cheng Fang, Rishabh Dixit, Waheed U. Bajwa, Mert Gurbuzbalaban

    Abstract: Empirical risk minimization (ERM) is a cornerstone of modern machine learning (ML), supported by advances in optimization theory that ensure efficient solutions with provable algorithmic convergence rates, which measure the speed at which optimization algorithms approach a solution, and statistical learning rates, which characterize how well the solution generalizes to unseen data. Privacy, memory… ▽ More

    Submitted 11 February, 2025; originally announced February 2025.

    Comments: preprint of a journal paper; 100 pages and 17 figures

  2. arXiv:2502.00885  [pdf, other

    stat.ML cs.LG math.OC math.PR

    Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise

    Authors: Thanh Dang, Melih Barsbey, A K M Rokonuzzaman Sonet, Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu

    Abstract: Understanding the generalization properties of optimization algorithms under heavy-tailed noise has gained growing attention. However, the existing theoretical results mainly focus on stochastic gradient descent (SGD) and the analysis of heavy-tailed optimizers beyond SGD is still missing. In this work, we establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed gradient noi… ▽ More

    Submitted 2 February, 2025; originally announced February 2025.

    Comments: 64 pages, 2 figures

  3. arXiv:2412.01993  [pdf, other

    cs.LG math.OC

    Generalized EXTRA stochastic gradient Langevin dynamics

    Authors: Mert Gurbuzbalaban, Mohammad Rafiqul Islam, Xiaoyu Wang, Lingjiong Zhu

    Abstract: Langevin algorithms are popular Markov Chain Monte Carlo methods for Bayesian learning, particularly when the aim is to sample from the posterior distribution of a parametric model, given the input data and the prior distribution over the model parameters. Their stochastic versions such as stochastic gradient Langevin dynamics (SGLD) allow iterative learning based on randomly sampled mini-batches… ▽ More

    Submitted 2 December, 2024; originally announced December 2024.

  4. arXiv:2405.14130  [pdf, other

    math.OC

    High-probability complexity guarantees for nonconvex minimax problems

    Authors: Yassine Laguel, Yasa Syed, Necdet Serhat Aybat, Mert Gürbüzbalaban

    Abstract: Stochastic smooth nonconvex minimax problems are prevalent in machine learning, e.g., GAN training, fair classification, and distributionally robust learning. Stochastic gradient descent ascent (GDA)-type methods are popular in practice due to their simplicity and single-loop nature. However, there is a significant gap between the theory and practice regarding high-probability complexity guarantee… ▽ More

    Submitted 14 November, 2024; v1 submitted 22 May, 2024; originally announced May 2024.

  5. arXiv:2403.07806  [pdf, other

    math.OC

    A Stochastic GDA Method With Backtracking For Solving Nonconvex (Strongly) Concave Minimax Problems

    Authors: Qiushui Xu, Xuan Zhang, Necdet Serhat Aybat, Mert Gürbüzbalaban

    Abstract: We propose a stochastic GDA (gradient descent ascent) method with backtracking (SGDA-B) to solve nonconvex-(strongly) concave (NCC) minimax problems $\min_x \max_y \sum_{i=1}^N g_i(x_i)+f(x,y)-h(y)$, where $h$ and $g_i$ for $i = 1, \ldots, N$ are closed, convex functions, $f$ is $L$-smooth and $μ$-strongly concave in $y$ for some $μ\geq 0$. We consider two scenarios: (i) the deterministic setting… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  6. arXiv:2403.02051  [pdf, ps, other

    stat.ML cs.CR cs.LG math.ST

    Privacy of SGD under Gaussian or Heavy-Tailed Noise: Guarantees without Gradient Clipping

    Authors: Umut Şimşekli, Mert Gürbüzbalaban, Sinan Yıldırım, Lingjiong Zhu

    Abstract: The injection of heavy-tailed noise into the iterates of stochastic gradient descent (SGD) has garnered growing interest in recent years due to its theoretical and empirical benefits for optimization and generalization. However, its implications for privacy preservation remain largely unexplored. Aiming to bridge this gap, we provide differential privacy (DP) guarantees for noisy SGD, when the inj… ▽ More

    Submitted 12 May, 2025; v1 submitted 4 March, 2024; originally announced March 2024.

  7. arXiv:2309.11481  [pdf, other

    math.OC

    Robustly Stable Accelerated Momentum Methods With A Near-Optimal L2 Gain and $H_\infty$ Performance

    Authors: Mert Gurbuzbalaban

    Abstract: We consider the problem of minimizing a strongly convex smooth function where the gradients are subject to additive worst-case deterministic errors that are square-summable. We study the trade-offs between the convergence rate and robustness to gradient errors when designing the parameters of a first-order algorithm. We focus on a general class of momentum methods (GMM) with constant stepsize and… ▽ More

    Submitted 20 October, 2023; v1 submitted 20 September, 2023; originally announced September 2023.

  8. arXiv:2307.07030  [pdf, other

    math.OC cs.LG eess.SY

    Accelerated gradient methods for nonconvex optimization: Escape trajectories from strict saddle points and convergence to local minima

    Authors: Rishabh Dixit, Mert Gurbuzbalaban, Waheed U. Bajwa

    Abstract: This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of nonconvex functions, this work proposes a broad class of… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: 107 pages, 10 figures; pre-print of a journal submission

  9. arXiv:2305.12056  [pdf, ps, other

    stat.ML cs.LG math.OC

    Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent

    Authors: Lingjiong Zhu, Mert Gurbuzbalaban, Anant Raj, Umut Simsekli

    Abstract: Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically requir… ▽ More

    Submitted 28 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: 49 pages, NeurIPS 2023

  10. arXiv:2304.00444  [pdf, other

    math.OC

    High Probability and Risk-Averse Guarantees for a Stochastic Accelerated Primal-Dual Method

    Authors: Yassine Laguel, Necdet Serhat Aybat, Mert Gürbüzbalaban

    Abstract: We consider stochastic strongly-convex-strongly-concave (SCSC) saddle point (SP) problems which frequently arise in applications ranging from distributionally robust learning to game theory and fairness in machine learning. We focus on the recently developed stochastic accelerated primal-dual algorithm (SAPD), which admits optimal complexity in several settings as an accelerated algorithm. We prov… ▽ More

    Submitted 14 July, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

  11. arXiv:2302.05516  [pdf, other

    stat.ML cs.LG math.OC

    Cyclic and Randomized Stepsizes Invoke Heavier Tails in SGD than Constant Stepsize

    Authors: Mert Gürbüzbalaban, Yuanhan Hu, Umut Şimşekli, Lingjiong Zhu

    Abstract: Cyclic and randomized stepsizes are widely used in the deep learning practice and can often outperform standard stepsize choices such as constant stepsize in SGD. Despite their empirical success, not much is currently known about when and why they can theoretically improve the generalization performance. We consider a general class of Markovian stepsizes for learning, which contain i.i.d. random s… ▽ More

    Submitted 29 August, 2023; v1 submitted 10 February, 2023; originally announced February 2023.

    Comments: To Appear

    Journal ref: Transactions of Machine Learning Research, 2023

  12. arXiv:2301.11885  [pdf, other

    stat.ML cs.LG

    Algorithmic Stability of Heavy-Tailed SGD with General Loss Functions

    Authors: Anant Raj, Lingjiong Zhu, Mert Gürbüzbalaban, Umut Şimşekli

    Abstract: Heavy-tail phenomena in stochastic gradient descent (SGD) have been reported in several empirical studies. Experimental evidence in previous works suggests a strong interplay between the heaviness of the tails and generalization behavior of SGD. To address this empirical phenomena theoretically, several works have made strong topological and statistical assumptions to link the generalization error… ▽ More

    Submitted 30 January, 2023; v1 submitted 27 January, 2023; originally announced January 2023.

    Comments: The first two authors contributed equally to this work

  13. arXiv:2301.06619  [pdf, other

    math.OC math.ST

    Distributionally Robust Learning with Weakly Convex Losses: Convergence Rates and Finite-Sample Guarantees

    Authors: Landi Zhu, Mert Gürbüzbalaban, Andrzej Ruszczyński

    Abstract: We consider a distributionally robust stochastic optimization problem and formulate it as a stochastic two-level composition optimization problem with the use of the mean--semideviation risk measure. In this setting, we consider a single time-scale algorithm, involving two versions of the inner function value tracking: linearized tracking of a continuously differentiable loss function, and SPIDER… ▽ More

    Submitted 9 June, 2023; v1 submitted 16 January, 2023; originally announced January 2023.

  14. arXiv:2212.00570  [pdf, other

    stat.ML cs.LG stat.CO

    Penalized Overdamped and Underdamped Langevin Monte Carlo Algorithms for Constrained Sampling

    Authors: Mert Gürbüzbalaban, Yuanhan Hu, Lingjiong Zhu

    Abstract: We consider the constrained sampling problem where the goal is to sample from a target distribution $π(x)\propto e^{-f(x)}$ when $x$ is constrained to lie on a convex body $\mathcal{C}$. Motivated by penalty methods from continuous optimization, we propose penalized Langevin Dynamics (PLD) and penalized underdamped Langevin Monte Carlo (PULMC) methods that convert the constrained sampling problem… ▽ More

    Submitted 14 April, 2024; v1 submitted 29 November, 2022; originally announced December 2022.

    Journal ref: Journal of Machine Learning Research 2024, Vol. 25, No. 263, 1-67

  15. arXiv:2206.01274  [pdf, other

    stat.ML cs.LG

    Algorithmic Stability of Heavy-Tailed Stochastic Gradient Descent on Least Squares

    Authors: Anant Raj, Melih Barsbey, Mert Gürbüzbalaban, Lingjiong Zhu, Umut Şimşekli

    Abstract: Recent studies have shown that heavy tails can emerge in stochastic optimization and that the heaviness of the tails have links to the generalization error. While these studies have shed light on interesting aspects of the generalization behavior in modern settings, they relied on strong topological and statistical regularity assumptions, which are hard to verify in practice. Furthermore, it has b… ▽ More

    Submitted 13 February, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

    Comments: 50 pages

  16. arXiv:2205.15084  [pdf, other

    math.OC

    SAPD+: An Accelerated Stochastic Method for Nonconvex-Concave Minimax Problems

    Authors: Xuan Zhang, Necdet Serhat Aybat, Mert Gürbüzbalaban

    Abstract: We propose a new stochastic method SAPD+ for solving nonconvex-concave minimax problems of the form $\min\max\mathcal{L}(x,y)=f(x)+Φ(x,y)-g(y)$, where $f,g$ are closed convex and $Φ(x,y)$ is a smooth function that is weakly convex in $x$, (strongly) concave in $y$. Let $δ^2$ denote the variance bound for the unbiased stochastic oracle used within SAPD+ to estimate $\nablaΦ$. When $δ>0$, for both s… ▽ More

    Submitted 13 October, 2024; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: The complexity bound for SAPD+ with variance reduction is corrected in Theorem 4 and the related discussion in Remark 8 is also updated

    Journal ref: Advances in Neural Information Processing Systems, 35, pp.21668-21681 (2022)

  17. arXiv:2205.06689  [pdf, other

    stat.ML cs.LG math.OC

    Heavy-Tail Phenomenon in Decentralized SGD

    Authors: Mert Gurbuzbalaban, Yuanhan Hu, Umut Simsekli, Kun Yuan, Lingjiong Zhu

    Abstract: Recent theoretical studies have shown that heavy-tails can emerge in stochastic optimization due to `multiplicative noise', even under surprisingly simple settings, such as linear regression with Gaussian data. While these studies have uncovered several interesting phenomena, they consider conventional stochastic optimization problems, which exclude decentralized settings that naturally arise in m… ▽ More

    Submitted 16 May, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

    Journal ref: IISE Transactions 2025, Vol. 57, No. 7, 788-802

  18. arXiv:2204.11292  [pdf, other

    math.OC

    Entropic Risk-Averse Generalized Momentum Methods

    Authors: Bugra Can, Mert Gürbüzbalaban

    Abstract: In the context of first-order algorithms subject to random gradient noise, we study the trade-offs between the convergence rate (which quantifies how fast the initial conditions are forgotten) and the "risk" of suboptimality, i.e. deviations from the expected suboptimality. We focus on a general class of momentum methods (GMM) which recover popular methods such as gradient descent (GD), accelerate… ▽ More

    Submitted 10 March, 2025; v1 submitted 24 April, 2022; originally announced April 2022.

    MSC Class: 62L20; 90C15; 90C17; 90C25; 90C30; 93C05; 93C10

  19. arXiv:2202.09688  [pdf, other

    math.OC cs.LG

    A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm

    Authors: Bugra Can, Mert Gurbuzbalaban, Necdet Serhat Aybat

    Abstract: In this work, we consider strongly convex strongly concave (SCSC) saddle point (SP) problems $\min_{x\in\mathbb{R}^{d_x}}\max_{y\in\mathbb{R}^{d_y}}f(x,y)$ where $f$ is $L$-smooth, $f(.,y)$ is $μ$-strongly convex for every $y$, and $f(x,.)$ is $μ$-strongly concave for every $x$. Such problems arise frequently in machine learning in the context of robust empirical risk minimization (ERM), e.g.… ▽ More

    Submitted 19 February, 2022; originally announced February 2022.

  20. arXiv:2111.12743  [pdf, other

    math.OC

    Robust Accelerated Primal-Dual Methods for Computing Saddle Points

    Authors: Xuan Zhang, Necdet Serhat Aybat, Mert Gürbüzbalaban

    Abstract: We consider strongly-convex-strongly-concave saddle point problems assuming we have access to unbiased stochastic estimates of the gradients. We propose a stochastic accelerated primal-dual (SAPD) algorithm and show that SAPD sequence, generated using constant primal-dual step sizes, linearly converges to a neighborhood of the unique saddle point. Interpreting the size of the neighborhood as a mea… ▽ More

    Submitted 1 September, 2024; v1 submitted 24 November, 2021; originally announced November 2021.

    Comments: Final version of the manuscript. Typos corrected

    Journal ref: SIAM Journal on Optimization, 34(1), pp.1097-1130 (2024)

  21. arXiv:2108.09365  [pdf, other

    math.OC cs.DC

    L-DQN: An Asynchronous Limited-Memory Distributed Quasi-Newton Method

    Authors: Bugra Can, Saeed Soori, Maryam Mehri Dehnavi, Mert Gürbüzbalaban

    Abstract: This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i.e., in every iteration the master node and workers… ▽ More

    Submitted 4 September, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

    MSC Class: 68W15 (Primary)

  22. arXiv:2106.04881  [pdf, other

    stat.ML cs.LG

    Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

    Authors: Alexander Camuto, George Deligiannidis, Murat A. Erdogdu, Mert Gürbüzbalaban, Umut Şimşekli, Lingjiong Zhu

    Abstract: Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generaliz… ▽ More

    Submitted 9 June, 2021; originally announced June 2021.

    Comments: 34 pages including Supplement, 4 Figures

  23. arXiv:2106.03947  [pdf, other

    cs.LG

    TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion

    Authors: Saeed Soori, Bugra Can, Baourun Mu, Mert Gürbüzbalaban, Maryam Mehri Dehnavi

    Abstract: This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with… ▽ More

    Submitted 3 March, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

  24. arXiv:2102.10346  [pdf, other

    math.OC stat.ML

    Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

    Authors: Hongjian Wang, Mert Gürbüzbalaban, Lingjiong Zhu, Umut Şimşekli, Murat A. Erdogdu

    Abstract: Recent studies have provided both empirical and theoretical evidence illustrating that heavy tails can emerge in stochastic gradient descent (SGD) in various scenarios. Such heavy tails potentially result in iterates with diverging variance, which hinders the use of conventional convergence analysis techniques that rely on the existence of the second-order moments. In this paper, we provide conver… ▽ More

    Submitted 20 February, 2021; originally announced February 2021.

  25. arXiv:2102.07006  [pdf, other

    stat.ML cs.LG

    Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections

    Authors: Alexander Camuto, Xiaoyu Wang, Lingjiong Zhu, Chris Holmes, Mert Gürbüzbalaban, Umut Şimşekli

    Abstract: Gaussian noise injections (GNIs) are a family of simple and widely-used regularisation methods for training neural networks, where one injects additive or multiplicative Gaussian noise to the network activations at every iteration of the optimisation algorithm, which is typically chosen as stochastic gradient descent (SGD). In this paper we focus on the so-called `implicit effect' of GNIs, which i… ▽ More

    Submitted 10 June, 2021; v1 submitted 13 February, 2021; originally announced February 2021.

    Comments: Main paper of 12 pages, followed by appendix

  26. arXiv:2101.02625  [pdf, other

    math.OC cs.LG eess.SY math.DS

    Boundary Conditions for Linear Exit Time Gradient Trajectories Around Saddle Points: Analysis and Algorithm

    Authors: Rishabh Dixit, Mert Gurbuzbalaban, Waheed U. Bajwa

    Abstract: Gradient-related first-order methods have become the workhorse of large-scale numerical optimization problems. Many of these problems involve nonconvex objective functions with multiple saddle points, which necessitates an understanding of the behavior of discrete trajectories of first-order methods within the geometrical landscape of these functions. This paper concerns convergence of first-order… ▽ More

    Submitted 9 March, 2022; v1 submitted 7 January, 2021; originally announced January 2021.

    Comments: 69 pages; 10 figures; extensive revision of the earlier version, including fewer assumptions, more comparisons with prior art, and new theoretical results

  27. arXiv:2008.01989  [pdf, ps, other

    cs.LG cs.CR math.OC stat.ML

    Differentially Private Accelerated Optimization Algorithms

    Authors: Nurdan Kuru, Ş. İlker Birbil, Mert Gurbuzbalaban, Sinan Yildirim

    Abstract: We present two classes of differentially private optimization algorithms derived from the well-known accelerated first-order methods. The first algorithm is inspired by Polyak's heavy ball method and employs a smoothing approach to decrease the accumulated noise on the gradient steps required for differential privacy. The second class of algorithms are based on Nesterov's accelerated gradient meth… ▽ More

    Submitted 5 August, 2020; originally announced August 2020.

    Comments: 28 pages, 4 figures

    MSC Class: 68P27; 90C30; 90C25

    Journal ref: SIAM Journal on Optimization 2022 32:2, 795-821

  28. arXiv:2007.00590  [pdf, other

    stat.ML cs.LG math.OC

    Decentralized Stochastic Gradient Langevin Dynamics and Hamiltonian Monte Carlo

    Authors: Mert Gürbüzbalaban, Xuefeng Gao, Yuanhan Hu, Lingjiong Zhu

    Abstract: Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Hamiltonian Monte Carlo (SGHMC) are two popular Markov Chain Monte Carlo (MCMC) algorithms for Bayesian inference that can scale to large datasets, allowing to sample from the posterior distribution of the parameters of a statistical model given the input data and the prior distribution over the model parameters. However, these a… ▽ More

    Submitted 26 August, 2021; v1 submitted 1 July, 2020; originally announced July 2020.

    MSC Class: Primary: 68W15; 62F15; 65C05; 62D05; 62L20; secondary: 60J20; 90C15

  29. arXiv:2006.06733  [pdf, other

    math.OC cs.LG

    IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

    Authors: Yossi Arjevani, Joan Bruna, Bugra Can, Mert Gürbüzbalaban, Stefanie Jegelka, Hongzhou Lin

    Abstract: We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA arXiv:140… ▽ More

    Submitted 11 June, 2020; originally announced June 2020.

  30. arXiv:2006.04873  [pdf, other

    math.OC cs.LG math.ST

    A Stochastic Subgradient Method for Distributionally Robust Non-Convex Learning

    Authors: Mert Gürbüzbalaban, Andrzej Ruszczyński, Landi Zhu

    Abstract: We consider a distributionally robust formulation of stochastic optimization problems arising in statistical learning, where robustness is with respect to uncertainty in the underlying data distribution. Our formulation builds on risk-averse optimization techniques and the theory of coherent risk measures. It uses semi-deviation risk for quantifying uncertainty, allowing us to compute solutions th… ▽ More

    Submitted 7 June, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

  31. arXiv:2006.04740  [pdf, other

    math.OC cs.LG math.ST

    The Heavy-Tail Phenomenon in SGD

    Authors: Mert Gurbuzbalaban, Umut Şimşekli, Lingjiong Zhu

    Abstract: In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the s… ▽ More

    Submitted 14 June, 2021; v1 submitted 8 June, 2020; originally announced June 2020.

    Journal ref: Published as a conference paper at International Conference on Machine Learning (ICML) 2021

  32. arXiv:2006.01106  [pdf, other

    math.OC cs.LG eess.SY

    Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points

    Authors: Rishabh Dixit, Mert Gurbuzbalaban, Waheed U. Bajwa

    Abstract: This paper considers the problem of understanding the exit time for trajectories of gradient-related first-order methods from saddle neighborhoods under some initial boundary conditions. Given the 'flat' geometry around saddle points, first-order methods can struggle to escape these regions in a fast manner due to the small magnitudes of gradients encountered. In particular, while it is known that… ▽ More

    Submitted 6 October, 2023; v1 submitted 1 June, 2020; originally announced June 2020.

    Comments: 70 pages; pre-print of the journal paper published in Information and Inference: A Journal of the IMA, 2023

    MSC Class: 90C26; 15Axx; 41A58; 65Hxx

    Journal ref: Information and Inference: A Journal of the IMA, vol. 12, no. 2, pp. 714-786, Jun. 2023

  33. arXiv:2005.11878  [pdf, other

    cs.LG math.OC math.ST stat.ML

    Fractional moment-preserving initialization schemes for training deep neural networks

    Authors: Mert Gurbuzbalaban, Yuanhan Hu

    Abstract: A traditional approach to initialization in deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. On the other hand, several studies show that during the training process, the distribution of stochastic gradients can be heavy-tailed especially for small batch sizes. In this case, weights and therefore pre-activations can be modeled wi… ▽ More

    Submitted 13 February, 2021; v1 submitted 24 May, 2020; originally announced May 2020.

  34. arXiv:2004.02823  [pdf, other

    math.OC stat.ML

    Non-Convex Optimization via Non-Reversible Stochastic Gradient Langevin Dynamics

    Authors: Yuanhan Hu, Xiaoyu Wang, Xuefeng Gao, Mert Gurbuzbalaban, Lingjiong Zhu

    Abstract: Stochastic Gradient Langevin Dynamics (SGLD) is a powerful algorithm for optimizing a non-convex objective, where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. SGLD is based on the overdamped Langevin diffusion which is reversible in time. By adding an anti-symmetric matrix to the drift term of the overdamped La… ▽ More

    Submitted 2 June, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

    Comments: 45 pages

  35. arXiv:2002.05685  [pdf, other

    stat.ML cs.LG

    Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

    Authors: Umut Şimşekli, Lingjiong Zhu, Yee Whye Teh, Mert Gürbüzbalaban

    Abstract: Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study… ▽ More

    Submitted 4 November, 2020; v1 submitted 13 February, 2020; originally announced February 2020.

    Comments: 20 pages, Published at International Conference on Machine Learning 2020

  36. arXiv:1912.00018  [pdf, other

    stat.ML cs.LG math.CA

    On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

    Authors: Umut Şimşekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, Levent Sagun

    Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Ga… ▽ More

    Submitted 29 November, 2019; originally announced December 2019.

    Comments: 32 pages. arXiv admin note: substantial text overlap with arXiv:1901.06053

  37. arXiv:1910.08701  [pdf, other

    math.OC cs.LG stat.ML

    Robust Distributed Accelerated Stochastic Gradient Methods for Multi-Agent Networks

    Authors: Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar, Umut Simsekli, Lingjiong Zhu

    Abstract: We study distributed stochastic gradient (D-SG) method and its accelerated variant (D-ASG) for solving decentralized strongly convex stochastic optimization problems where the objective function is distributed over several computational units, lying on a fixed but arbitrary connected communication graph, subject to local communication constraints where noisy estimates of the gradients are availabl… ▽ More

    Submitted 4 October, 2021; v1 submitted 19 October, 2019; originally announced October 2019.

  38. arXiv:1907.13110  [pdf, other

    math.OC

    Randomized Gossiping with Effective Resistance Weights: Performance Guarantees and Applications

    Authors: Bugra Can, Saeed Soori, Necdet Serhat Aybat, Maryam Mehri Dehnavi, Mert Gurbuzbalaban

    Abstract: The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced when a unit current is injected at one node and extracted from the other, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many applications, e.g., solving linear systems, Markov Chains, and continuous-time… ▽ More

    Submitted 16 October, 2021; v1 submitted 29 July, 2019; originally announced July 2019.

    MSC Class: 46N10; 47N10

  39. arXiv:1906.09069  [pdf, other

    stat.ML cs.LG

    First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

    Authors: Thanh Huy Nguyen, Umut Şimşekli, Mert Gürbüzbalaban, Gaël Richard

    Abstract: Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $α$-stable distributions, a family… ▽ More

    Submitted 21 June, 2019; originally announced June 2019.

  40. arXiv:1906.00506  [pdf, ps, other

    math.OC

    DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate

    Authors: Saeed Soori, Konstantin Mischenko, Aryan Mokhtari, Maryam Mehri Dehnavi, Mert Gurbuzbalaban

    Abstract: In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communication-efficie… ▽ More

    Submitted 10 June, 2019; v1 submitted 2 June, 2019; originally announced June 2019.

  41. arXiv:1901.08022  [pdf, other

    math.OC cs.LG stat.ML

    A Universally Optimal Multistage Accelerated Stochastic Gradient Method

    Authors: Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar

    Abstract: We study the problem of minimizing a strongly convex, smooth function when we have noisy estimates of its gradient. We propose a novel multistage accelerated algorithm that is universally optimal in the sense that it achieves the optimal rate both in the deterministic and stochastic case and operates without knowledge of noise characteristics. The algorithm consists of stages that use a stochastic… ▽ More

    Submitted 27 October, 2019; v1 submitted 23 January, 2019; originally announced January 2019.

    Comments: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

  42. arXiv:1901.07445  [pdf, other

    stat.ML cs.LG math.OC

    Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances

    Authors: Bugra Can, Mert Gurbuzbalaban, Lingjiong Zhu

    Abstract: Momentum methods such as Polyak's heavy ball (HB) method, Nesterov's accelerated gradient (AG) as well as accelerated projected gradient (APG) method have been commonly used in machine learning practice, but their performance is quite sensitive to noise in the gradients. We study these methods under a first-order stochastic oracle model where noisy estimates of the gradients are available. For str… ▽ More

    Submitted 16 May, 2019; v1 submitted 22 January, 2019; originally announced January 2019.

    Comments: 72 pages

    Journal ref: International Conference on Machine Learning 2019, 891-901

  43. arXiv:1901.06053  [pdf, other

    cs.LG stat.ML

    A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks

    Authors: Umut Simsekli, Levent Sagun, Mert Gurbuzbalaban

    Abstract: The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussiani… ▽ More

    Submitted 17 January, 2019; originally announced January 2019.

  44. arXiv:1812.07725  [pdf, other

    math.OC cs.LG math.NA math.PR stat.ML

    Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization

    Authors: Xuefeng Gao, Mert Gurbuzbalaban, Lingjiong Zhu

    Abstract: Langevin dynamics (LD) has been proven to be a powerful technique for optimizing a non-convex objective as an efficient algorithm to find local minima while eventually visiting a global minimum on longer time-scales. LD is based on the first-order Langevin diffusion which is reversible in time. We study two variants that are based on non-reversible Langevin diffusions: the underdamped Langevin dyn… ▽ More

    Submitted 2 October, 2020; v1 submitted 18 December, 2018; originally announced December 2018.

    MSC Class: 65K05; 90C26; 90C30; 82C31; 65C30

  45. arXiv:1809.04618  [pdf, other

    math.OC cs.LG

    Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration

    Authors: Xuefeng Gao, Mert Gürbüzbalaban, Lingjiong Zhu

    Abstract: Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is a variant of stochastic gradient with momentum where a controlled and properly scaled Gaussian noise is added to the stochastic gradients to steer the iterates towards a global minimum. Many works reported its empirical success in practice for solving stochastic non-convex optimization problems, in particular it has been observed to outperform… ▽ More

    Submitted 17 November, 2020; v1 submitted 12 September, 2018; originally announced September 2018.

  46. arXiv:1805.10579  [pdf, other

    math.OC cs.LG stat.ML

    Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions

    Authors: Necdet Serhat Aybat, Alireza Fallah, Mert Gurbuzbalaban, Asuman Ozdaglar

    Abstract: We study the trade-offs between convergence rate and robustness to gradient errors in designing a first-order algorithm. We focus on gradient descent (GD) and accelerated gradient (AG) methods for minimizing strongly convex functions when the gradient has random errors in the form of additive white noise. With gradient errors, the function values of the iterates need not converge to the optimal va… ▽ More

    Submitted 5 November, 2019; v1 submitted 27 May, 2018; originally announced May 2018.

    Comments: To appear in SIAM Journal on Optimization (SIOPT)

  47. arXiv:1803.08200  [pdf, ps, other

    math.OC

    Randomness and Permutations in Coordinate Descent Methods

    Authors: Mert Gurbuzbalaban, Asuman Ozdaglar, Nuri Denizcan Vanli, Stephen J. Wright

    Abstract: We consider coordinate descent (CD) methods with exact line search on convex quadratic problems. Our main focus is to study the performance of the CD method that use random permutations in each epoch and compare it to the performance of the CD methods that use deterministic orders and random sampling with replacement. We focus on a class of convex quadratic problems with a diagonally dominant Hess… ▽ More

    Submitted 21 March, 2018; originally announced March 2018.

  48. arXiv:1710.08883  [pdf, other

    cs.DC cs.LG math.NA math.OC

    Avoiding Communication in Proximal Methods for Convex Optimization Problems

    Authors: Saeed Soori, Aditya Devarakonda, James Demmel, Mert Gurbuzbalaban, Maryam Mehri Dehnavi

    Abstract: The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its performance on modern distributed architectures. The comm… ▽ More

    Submitted 24 October, 2017; originally announced October 2017.

  49. arXiv:1708.07190  [pdf, other

    math.OC

    Decentralized Computation of Effective Resistances and Acceleration of Consensus Algorithms

    Authors: Necdet Serhat Aybat, Mert Gurbuzbalaban

    Abstract: The effective resistance between a pair of nodes in a weighted undirected graph is defined as the potential difference induced between them when a unit current is injected at the first node and extracted at the second node, treating edge weights as the conductance values of edges. The effective resistance is a key quantity of interest in many applications and fields including solving linear system… ▽ More

    Submitted 23 August, 2017; originally announced August 2017.

  50. arXiv:1702.02486  [pdf, other

    math.OC

    Approximating the Real Structured Stability Radius with Frobenius Norm Bounded Perturbations

    Authors: Nicola Guglielmi, Mert Gurbuzbalaban, Tim Mitchell, Michael Overton

    Abstract: We propose a fast method to approximate the real stability radius of a linear dynamical system with output feedback, where the perturbations are restricted to be real valued and bounded with respect to the Frobenius norm. Our work builds on a number of scalable algorithms that have been proposed in recent years, ranging from methods that approximate the complex or real pseudospectral abscissa and… ▽ More

    Submitted 8 February, 2017; originally announced February 2017.