Skip to main content

Showing 1–50 of 63 results for author: Lee, J D

Searching in archive math. Search in all archives.
.
  1. arXiv:2502.05075  [pdf, other

    cs.LG math.NA stat.ML

    Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

    Authors: Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei

    Abstract: Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyz… ▽ More

    Submitted 22 May, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

    Comments: ICML 2025

  2. arXiv:2412.06388  [pdf, other

    cs.RO math.OC

    Sparse Identification of Nonlinear Dynamics-based Model Predictive Control for Multirotor Collision Avoidance

    Authors: Jayden Dongwoo Lee, Youngjae Kim, Yoonseong Kim, Hyochoong Bang

    Abstract: This paper proposes a data-driven model predictive control for multirotor collision avoidance considering uncertainty and an unknown model from a payload. To address this challenge, sparse identification of nonlinear dynamics (SINDy) is used to obtain the governing equation of the multirotor system. The SINDy can discover the equations of target systems with low data, assuming that few functions h… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  3. arXiv:2411.17668  [pdf, other

    cs.LG eess.SY math.OC stat.ML

    Anytime Acceleration of Gradient Descent

    Authors: Zihan Zhang, Jason D. Lee, Simon S. Du, Yuxin Chen

    Abstract: This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O(T^{-1.119})$ for any stopping time $T$, where the stepsize schedule is predetermined without prior knowledge of the stopping time. This res… ▽ More

    Submitted 8 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: v2: We improve the convergence rate from $O(T^{-1.03})$ to O(T^{-1.119}) through more precise computations

  4. arXiv:2411.17201  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

    Authors: Hengyu Fu, Zihao Wang, Eshaan Nichani, Jason D. Lee

    Abstract: In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of \textit{multiple nonlinear features} using three-layer neural networks. We examine a broad class of functions of the form $f^{\star}=g^{\star}\circ \bp$, where $\bp:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$ represents mul… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: 78 pages, 4 figures

  5. arXiv:2410.24206  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Understanding Optimization in Deep Learning with Central Flows

    Authors: Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, Jason D. Lee

    Abstract: Optimization in deep learning remains poorly understood, even in the simple setting of deterministic (i.e. full-batch) training. A key difficulty is that much of an optimizer's behavior is implicitly determined by complex oscillatory dynamics, referred to as the "edge of stability." The main contribution of this paper is to show that an optimizer's implicit behavior can be explicitly captured by a… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: first two authors contributed equally; author order determined by coin flip

  6. arXiv:2406.19617  [pdf, ps, other

    cs.LG cs.IT math.OC

    Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity

    Authors: Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee

    Abstract: Optimization of convex functions under stochastic zeroth-order feedback has been a major and challenging question in online learning. In this work, we consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the mi… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  7. arXiv:2406.08466  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Scaling Laws in Linear Regression: Compute, Parameters, and Data

    Authors: Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

    Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, wh… ▽ More

    Submitted 29 October, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  8. arXiv:2403.03183  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Well Can Transformers Emulate In-context Newton's Method?

    Authors: Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

    Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher orde… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  9. arXiv:2402.11867  [pdf, other

    cs.LG math.OC

    LoRA Training in the NTK Regime has No Spurious Local Minima

    Authors: Uijeong Jang, Jason D. Lee, Ernest K. Ryu

    Abstract: Low-rank adaptation (LoRA) has become the standard approach for parameter-efficient fine-tuning of large language models (LLM), but our theoretical understanding of LoRA has been limited. In this work, we theoretically analyze LoRA fine-tuning in the neural tangent kernel (NTK) regime with $N$ data points, showing: (i) full fine-tuning (without LoRA) admits a low-rank solution of rank… ▽ More

    Submitted 28 May, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: 23 pages

  10. arXiv:2312.00854  [pdf, other

    physics.med-ph cs.AI cs.LG math.NA stat.CO

    A Probabilistic Neural Twin for Treatment Planning in Peripheral Pulmonary Artery Stenosis

    Authors: John D. Lee, Jakob Richter, Martin R. Pfaller, Jason M. Szafron, Karthik Menon, Andrea Zanoni, Michael R. Ma, Jeffrey A. Feinstein, Jacqueline Kreutzer, Alison L. Marsden, Daniele E. Schiavazzi

    Abstract: The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  11. arXiv:2305.18505  [pdf, ps, other

    cs.LG cs.AI math.ST stat.ML

    Provable Reward-Agnostic Preference-Based Reinforcement Learning

    Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

    Abstract: Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In t… ▽ More

    Submitted 17 April, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: ICLR 2024 Spotlight

  12. arXiv:2305.17608  [pdf, other

    cs.LG cs.AI cs.CL math.OC stat.ML

    Reward Collapse in Aligning Large Language Models

    Authors: Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

    Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results i… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

  13. arXiv:2305.14816  [pdf, ps, other

    cs.LG math.ST stat.ML

    Provable Offline Preference-Based Reinforcement Learning

    Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

    Abstract: In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offl… ▽ More

    Submitted 29 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: The first two authors contribute equally

  14. arXiv:2305.10282  [pdf, ps, other

    cs.LG cs.IT math.ST stat.ML

    Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

    Authors: Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, Yuxin Chen

    Abstract: This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic e… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  15. arXiv:2303.03095  [pdf, other

    cs.GT cs.LG math.OC

    Can We Find Nash Equilibria at a Linear Rate in Markov Games?

    Authors: Zhuoqing Song, Jason D. Lee, Zhuoran Yang

    Abstract: We study decentralized learning in two-player zero-sum discounted Markov games where the goal is to design a policy optimization algorithm for either agent satisfying two properties. First, the player does not need to know the policy of the opponent to update its policy. Second, when both players adopt the algorithm, their joint policy converges to a Nash equilibrium of the game. To this end, we c… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: ICLR 2023

  16. arXiv:2301.11500  [pdf, other

    cs.LG math.OC stat.ML

    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

    Authors: Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee

    Abstract: It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the gr… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  17. arXiv:2210.06705  [pdf, ps, other

    cs.LG cs.AI math.OC

    From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

    Authors: Satyen Kale, Jason D. Lee, Chris De Sa, Ayush Sekhari, Karthik Sridharan

    Abstract: Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is provi… ▽ More

    Submitted 12 October, 2022; originally announced October 2022.

  18. arXiv:2209.15594  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

    Authors: Alex Damian, Eshaan Nichani, Jason D. Lee

    Abstract: Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen e… ▽ More

    Submitted 10 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: ICLR 2023, first two authors contributed equally

  19. arXiv:2206.12020  [pdf, ps, other

    cs.LG math.ST stat.ME stat.ML

    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

    Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

    Abstract: We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as we… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

  20. arXiv:2107.02377  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    A Short Note on the Relationship of Information Gain and Eluder Dimension

    Authors: Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

  21. arXiv:2106.06530  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Label Noise SGD Provably Prefers Flat Global Minimizers

    Authors: Alex Damian, Tengyu Ma, Jason D. Lee

    Abstract: In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to… ▽ More

    Submitted 4 December, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: 57 pages, 5 figures, NeurIPS 2021

  22. arXiv:2105.11066  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

    Authors: Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi

    Abstract: Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operationa… ▽ More

    Submitted 10 January, 2023; v1 submitted 23 May, 2021; originally announced May 2021.

  23. arXiv:2103.10897  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Bilinear Classes: A Structural Framework for Provable Generalization in RL

    Authors: Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, Ruosong Wang

    Abstract: This work introduces Bilinear Classes, a new structural framework, which permit generalization in reinforcement learning in a wide variety of settings through the use of function approximation. The framework incorporates nearly all existing models in which a polynomial sample complexity is achievable, and, notably, also includes new models, such as the Linear $Q^*/V^*$ model in which both the opti… ▽ More

    Submitted 11 July, 2021; v1 submitted 19 March, 2021; originally announced March 2021.

    Comments: Expanded extension section to include generalized linear bellman complete and changed related work

  24. arXiv:2102.08903  [pdf, ps, other

    cs.LG cs.GT math.OC stat.ML

    Provably Efficient Policy Optimization for Two-Player Zero-Sum Markov Games

    Authors: Yulai Zhao, Yuandong Tian, Jason D. Lee, Simon S. Du

    Abstract: Policy-based methods with function approximation are widely used for solving two-player zero-sum games with large state and/or action spaces. However, it remains elusive how to obtain optimization and statistical guarantees for such algorithms. We present a new policy optimization algorithm with function approximation and prove that under standard regularity conditions on the Markov game and the f… ▽ More

    Submitted 26 February, 2022; v1 submitted 17 February, 2021; originally announced February 2021.

    Comments: AISTATS 2022

  25. arXiv:2007.01452  [pdf, other

    stat.ML cs.LG math.OC

    Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

    Authors: Cong Fang, Jason D. Lee, Pengkun Yang, Tong Zhang

    Abstract: This paper proposes a new mean-field framework for over-parameterized deep neural networks (DNNs), which can be used to analyze neural network training. In this framework, a DNN is represented by probability measures and functions over its features (that is, the function values of the hidden units over the training data) in the continuous limit, instead of the neural network parameters as most exi… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  26. arXiv:2006.09486  [pdf, other

    cs.LG math.OC stat.ML

    Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters

    Authors: Kaiyi Ji, Jason D. Lee, Yingbin Liang, H. Vincent Poor

    Abstract: Although model-agnostic meta-learning (MAML) is a very successful algorithm in meta-learning practice, it can have high computational cost because it updates all model parameters over both the inner loop of task-specific adaptation and the outer-loop of meta initialization training. A more efficient algorithm ANIL (which refers to almost no inner loop) was proposed recently by Raghu et al. 2019, w… ▽ More

    Submitted 22 October, 2020; v1 submitted 16 June, 2020; originally announced June 2020.

    Comments: Accepted by NeurIPS 2020

  27. arXiv:2002.09434  [pdf, ps, other

    cs.LG math.OC stat.ML

    Few-Shot Learning via Learning the Representation, Provably

    Authors: Simon S. Du, Wei Hu, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: This paper studies few-shot learning via representation learning, where one uses $T$ source tasks with $n_1$ data per task to learn a representation in order to reduce the sample complexity of a target task for which there is only $n_2 (\ll n_1)$ data. Specifically, we focus on the setting where there exists a good \emph{common representation} between source and target, and our goal is to understa… ▽ More

    Submitted 30 March, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

    Comments: ICLR2021

  28. arXiv:2002.07125  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Agnostic Q-learning with Function Approximation in Deterministic Systems: Tight Bounds on Approximation Error and Sample Complexity

    Authors: Simon S. Du, Jason D. Lee, Gaurav Mahajan, Ruosong Wang

    Abstract: The current paper studies the problem of agnostic $Q$-learning with function approximation in deterministic systems where the optimal $Q$-function is approximable by a function in the class $\mathcal{F}$ with approximation error $δ\ge 0$. We propose a novel recursion-based algorithm and show that if $δ= O\left(ρ/\sqrt{\dim_E}\right)$, then one can find the optimal policy using… ▽ More

    Submitted 17 February, 2020; originally announced February 2020.

  29. arXiv:1910.01619  [pdf, ps, other

    cs.LG math.OC stat.ML

    Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks

    Authors: Yu Bai, Jason D. Lee

    Abstract: Recent theoretical work has established connections between over-parametrized neural networks and linearized models governed by he Neural Tangent Kernels (NTKs). NTK theory leads to concrete convergence and generalization results, yet the empirical performance of neural networks are observed to exceed their linearized models, suggesting insufficiency of this theory. Towards closing this gap, we… ▽ More

    Submitted 14 February, 2020; v1 submitted 3 October, 2019; originally announced October 2019.

    Comments: Published at ICLR 2020

  30. arXiv:1907.11687  [pdf, other

    math.OC cs.IT

    Incremental Methods for Weakly Convex Optimization

    Authors: Xiao Li, Zhihui Zhu, Anthony Man-Cho So, Jason D Lee

    Abstract: Incremental methods are widely utilized for solving finite-sum optimization problems in machine learning and signal processing. In this paper, we study a family of incremental methods -- including incremental subgradient, incremental proximal point, and incremental prox-linear methods -- for solving weakly convex optimization problems. Such a problem class covers many nonsmooth nonconvex instances… ▽ More

    Submitted 23 December, 2022; v1 submitted 26 July, 2019; originally announced July 2019.

    Comments: 26 pages

    MSC Class: 68Q25; 65K10; 90C90; 90C26; 90C06

  31. arXiv:1905.10027  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    Neural Temporal-Difference and Q-Learning Provably Converge to Global Optima

    Authors: Qi Cai, Zhuoran Yang, Jason D. Lee, Zhaoran Wang

    Abstract: Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that ne… ▽ More

    Submitted 15 April, 2020; v1 submitted 24 May, 2019; originally announced May 2019.

  32. arXiv:1902.08297  [pdf, other

    math.OC cs.LG stat.ML

    Solving a Class of Non-Convex Min-Max Games Using Iterative First Order Methods

    Authors: Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, Meisam Razaviyayn

    Abstract: Recent applications that arise in machine learning have surged significant interest in solving min-max saddle point games. This problem has been extensively studied in the convex-concave regime for which a global equilibrium solution can be computed efficiently. In this paper, we study the problem in the non-convex regime and show that an \varepsilon--first order stationary point of the game can b… ▽ More

    Submitted 30 October, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

  33. arXiv:1812.02878  [pdf, ps, other

    math.OC cs.GT cs.LG

    Solving Non-Convex Non-Concave Min-Max Games Under Polyak-Łojasiewicz Condition

    Authors: Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee

    Abstract: In this short note, we consider the problem of solving a min-max zero-sum game. This problem has been extensively studied in the convex-concave regime where the global solution can be computed efficiently. Recently, there have also been developments for finding the first order stationary points of the game when one of the player's objective is concave or (weakly) concave. This work focuses on the… ▽ More

    Submitted 6 December, 2018; originally announced December 2018.

  34. arXiv:1811.03804  [pdf, ps, other

    cs.LG cs.AI cs.CV math.OC stat.ML

    Gradient Descent Finds Global Minima of Deep Neural Networks

    Authors: Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai

    Abstract: Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architectur… ▽ More

    Submitted 28 May, 2019; v1 submitted 9 November, 2018; originally announced November 2018.

    Comments: ICML 2019

  35. arXiv:1810.02024  [pdf, other

    math.OC

    Convergence to Second-Order Stationarity for Constrained Non-Convex Optimization

    Authors: Maher Nouiehed, Jason D. Lee, Meisam Razaviyayn

    Abstract: We consider the problem of finding an approximate second-order stationary point of a constrained non-convex optimization problem. We first show that, unlike the gradient descent method for unconstrained optimization, the vanilla projected gradient descent algorithm may converge to a strict saddle point even when there is only a single linear constraint. We then provide a hardness result by showing… ▽ More

    Submitted 2 June, 2020; v1 submitted 3 October, 2018; originally announced October 2018.

  36. arXiv:1809.08530  [pdf, ps, other

    math.OC cs.LG stat.ML

    Provably Correct Automatic Subdifferentiation for Qualified Programs

    Authors: Sham Kakade, Jason D. Lee

    Abstract: The Cheap Gradient Principle (Griewank 2008) --- the computational cost of computing the gradient of a scalar-valued function is nearly the same (often within a factor of $5$) as that of simply computing the function itself --- is of central importance in optimization; it allows us to quickly obtain (high dimensional) gradients of scalar loss functions which are subsequently used in black box grad… ▽ More

    Submitted 14 January, 2019; v1 submitted 23 September, 2018; originally announced September 2018.

  37. arXiv:1806.00900  [pdf, other

    cs.LG math.OC stat.ML

    Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced

    Authors: Simon S. Du, Wei Hu, Jason D. Lee

    Abstract: We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We rigorously prove that gradient flow (i.e. gradient descent with infinitesimal step size) effectively enforces the differences between squared norms across differen… ▽ More

    Submitted 31 October, 2018; v1 submitted 3 June, 2018; originally announced June 2018.

    Comments: In NIPS 2018

  38. arXiv:1804.07795  [pdf, other

    math.OC cs.LG

    Stochastic subgradient method converges on tame functions

    Authors: Damek Davis, Dmitriy Drusvyatskiy, Sham Kakade, Jason D. Lee

    Abstract: This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In part… ▽ More

    Submitted 25 May, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

    Comments: 32 pages, 1 figure

    MSC Class: 65K05; 65K10; 90C15; 90C30

  39. arXiv:1803.01206  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    On the Power of Over-parametrization in Neural Networks with Quadratic Activation

    Authors: Simon S. Du, Jason D. Lee

    Abstract: We provide new theoretical insights on why over-parametrization is effective in learning neural networks. For a $k$ hidden node shallow network with quadratic activation and $n$ training data points, we show as long as $ k \ge \sqrt{2n}$, over-parametrization enables local search algorithms to find a \emph{globally} optimal solution for general smooth and convex loss functions. Further, despite th… ▽ More

    Submitted 14 June, 2018; v1 submitted 3 March, 2018; originally announced March 2018.

    Comments: Accepted by ICML 2018

  40. arXiv:1802.08941  [pdf, ps, other

    math.OC cs.IT

    Gradient Primal-Dual Algorithm Converges to Second-Order Stationary Solutions for Nonconvex Distributed Optimization

    Authors: Mingyi Hong, Jason D. Lee, Meisam Razaviyayn

    Abstract: In this work, we study two first-order primal-dual based algorithms, the Gradient Primal-Dual Algorithm (GPDA) and the Gradient Alternating Direction Method of Multipliers (GADMM), for solving a class of linearly constrained non-convex optimization problems. We show that with random initialization of the primal and dual variables, both algorithms are able to compute second-order stationary solutio… ▽ More

    Submitted 24 February, 2018; originally announced February 2018.

  41. arXiv:1802.08249  [pdf, other

    cs.LG math.OC stat.ML

    On the Convergence and Robustness of Training GANs with Regularized Optimal Transport

    Authors: Maziar Sanjabi, Jimmy Ba, Meisam Razaviyayn, Jason D. Lee

    Abstract: Generative Adversarial Networks (GANs) are one of the most practical methods for learning data distributions. A popular GAN formulation is based on the use of Wasserstein distance as a metric between probability distributions. Unfortunately, minimizing the Wasserstein distance between the data distribution and the generative model distribution is a computationally challenging problem as its object… ▽ More

    Submitted 22 May, 2018; v1 submitted 21 February, 2018; originally announced February 2018.

  42. arXiv:1712.07179  [pdf, ps, other

    math.NT math.CO

    A note on Linnik's Theorem on quadratic non-residues

    Authors: Paul Balister, Béla Bollobás, Jonathan D. Lee, Robert Morris, Oliver Riordan

    Abstract: We present a short, self-contained, and purely combinatorial proof of Linnik's theorem: for any $\varepsilon > 0$ there exists a constant $C_\varepsilon$ such that for any $N$, there are at most $C_\varepsilon$ primes $p \leqslant N$ such that the least positive quadratic non-residue modulo $p$ exceeds $N^\varepsilon$.

    Submitted 19 December, 2017; originally announced December 2017.

    Comments: 6 pages

  43. arXiv:1712.00779  [pdf, other

    cs.LG cs.AI cs.CV math.OC stat.ML

    Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

    Authors: Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, Aarti Singh

    Abstract: We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_jσ(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned. When the labels are the outputs from a teacher network of the… ▽ More

    Submitted 14 June, 2018; v1 submitted 3 December, 2017; originally announced December 2017.

    Comments: Accepted by ICML 2018

  44. arXiv:1711.00501  [pdf, ps, other

    cs.LG cs.DS math.OC stat.ML

    Learning One-hidden-layer Neural Networks with Landscape Design

    Authors: Rong Ge, Jason D. Lee, Tengyu Ma

    Abstract: We consider the problem of learning a one-hidden-layer neural network: we assume the input $x\in \mathbb{R}^d$ is from Gaussian distribution and the label $y = a^\top σ(Bx) + ξ$, where $a$ is a nonnegative vector in $\mathbb{R}^m$ with $m\le d$, $B\in \mathbb{R}^{m\times d}$ is a full-rank weight matrix, and $ξ$ is a noise vector. We first give an analytic formula for the population risk of the st… ▽ More

    Submitted 2 November, 2017; v1 submitted 1 November, 2017; originally announced November 2017.

  45. arXiv:1710.07406  [pdf, ps, other

    stat.ML cs.LG math.OC

    First-order Methods Almost Always Avoid Saddle Points

    Authors: Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, Benjamin Recht

    Abstract: We establish that first-order methods avoid saddle points for almost all initializations. Our results apply to a wide variety of first-order methods, including gradient descent, block coordinate descent, mirror descent and variants thereof. The connecting thread is that such algorithms can be studied from a dynamical systems perspective in which appropriate instantiations of the Stable Manifold Th… ▽ More

    Submitted 19 October, 2017; originally announced October 2017.

  46. arXiv:1709.06129  [pdf, other

    cs.LG cs.AI cs.CV math.OC stat.ML

    When is a Convolutional Filter Easy To Learn?

    Authors: Simon S. Du, Jason D. Lee, Yuandong Tian

    Abstract: We analyze the convergence of (stochastic) gradient descent algorithm for learning a convolutional filter with Rectified Linear Unit (ReLU) activation function. Our analysis does not rely on any specific form of the input distribution and our proofs only use the definition of ReLU, in contrast with previous works that are restricted to standard Gaussian input. We show that (stochastic) gradient de… ▽ More

    Submitted 28 February, 2018; v1 submitted 18 September, 2017; originally announced September 2017.

    Comments: Published as a conference paper at ICLR 2018

  47. arXiv:1708.08552  [pdf, other

    cs.LG math.NA stat.ML

    An inexact subsampled proximal Newton-type method for large-scale machine learning

    Authors: Xuanqing Liu, Cho-Jui Hsieh, Jason D. Lee, Yuekai Sun

    Abstract: We propose a fast proximal Newton-type algorithm for minimizing regularized finite sums that returns an $ε$-suboptimal point in $\tilde{\mathcal{O}}(d(n + \sqrt{κd})\log(\frac{1}ε))$ FLOPS, where $n$ is number of samples, $d$ is feature dimension, and $κ$ is the condition number. As long as $n > d$, the proposed method is more efficient than state-of-the-art accelerated stochastic first-order meth… ▽ More

    Submitted 28 August, 2017; originally announced August 2017.

  48. arXiv:1707.04926  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Theoretical insights into the optimization landscape of over-parameterized shallow neural networks

    Authors: Mahdi Soltanolkotabi, Adel Javanmard, Jason D. Lee

    Abstract: In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the number of observations are fewer than the number of parameters in the model. We show that with quadratic activations the optimization landscape of training such shallow neural networks has certain favorable character… ▽ More

    Submitted 23 August, 2022; v1 submitted 16 July, 2017; originally announced July 2017.

    Comments: A mistake in the argument of Proposition 7.1 in the previous version of this manuscript was fixed

  49. arXiv:1705.10412  [pdf, other

    math.OC cs.LG stat.ML

    Gradient Descent Can Take Exponential Time to Escape Saddle Points

    Authors: Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, Aarti Singh

    Abstract: Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not sl… ▽ More

    Submitted 5 November, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

    Comments: Accepted by NIPS 2017

  50. arXiv:1704.07971  [pdf, other

    math.ST cs.LG stat.AP stat.ME stat.ML

    A Flexible Framework for Hypothesis Testing in High-dimensions

    Authors: Adel Javanmard, Jason D. Lee

    Abstract: Hypothesis testing in the linear regression model is a fundamental statistical problem. We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples ($p> n$). In order to make informative inference, we assume that the model is approximately sparse, that is the effect of covariates on the response can be well approximated by conditioning… ▽ More

    Submitted 21 September, 2019; v1 submitted 26 April, 2017; originally announced April 2017.

    Comments: 45 pages