Skip to main content

Showing 1–50 of 109 results for author: Lee, J D

Searching in archive stat. Search in all archives.
.
  1. arXiv:2504.19983  [pdf, other

    cs.LG stat.ML

    Emergence and scaling laws in SGD learning of shallow neural networks

    Authors: Yunwei Ren, Eshaan Nichani, Denny Wu, Jason D. Lee

    Abstract: We study the complexity of online stochastic gradient descent (SGD) for learning a two-layer neural network with $P$ neurons on isotropic Gaussian data: $f_*(\boldsymbol{x}) = \sum_{p=1}^P a_p\cdot σ(\langle\boldsymbol{x},\boldsymbol{v}_p^*\rangle)$, $\boldsymbol{x} \sim \mathcal{N}(0,\boldsymbol{I}_d)$, where the activation $σ:\mathbb{R}\to\mathbb{R}$ is an even function with information exponent… ▽ More

    Submitted 28 April, 2025; originally announced April 2025.

    Comments: 100 pages

  2. arXiv:2503.15477  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    What Makes a Reward Model a Good Teacher? An Optimization Perspective

    Authors: Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora

    Abstract: The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. While this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model… ▽ More

    Submitted 19 March, 2025; originally announced March 2025.

    Comments: Code available at https://github.com/princeton-pli/what-makes-good-rm

  3. arXiv:2502.05075  [pdf, other

    cs.LG math.NA stat.ML

    Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension

    Authors: Yijun Dong, Yicheng Li, Yunai Li, Jason D. Lee, Qi Lei

    Abstract: Weak-to-strong (W2S) generalization is a type of finetuning (FT) where a strong (large) student model is trained on pseudo-labels generated by a weak teacher. Surprisingly, W2S FT often outperforms the weak teacher. We seek to understand this phenomenon through the observation that FT often occurs in intrinsically low-dimensional spaces. Leveraging the low intrinsic dimensionality of FT, we analyz… ▽ More

    Submitted 22 May, 2025; v1 submitted 7 February, 2025; originally announced February 2025.

    Comments: ICML 2025

  4. arXiv:2412.06538  [pdf, other

    cs.LG cs.CL cs.IT stat.ML

    Understanding Factual Recall in Transformers via Associative Memories

    Authors: Eshaan Nichani, Jason D. Lee, Alberto Bietti

    Abstract: Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the s… ▽ More

    Submitted 9 December, 2024; originally announced December 2024.

  5. arXiv:2411.17668  [pdf, other

    cs.LG eess.SY math.OC stat.ML

    Anytime Acceleration of Gradient Descent

    Authors: Zihan Zhang, Jason D. Lee, Simon S. Du, Yuxin Chen

    Abstract: This work investigates stepsize-based acceleration of gradient descent with {\em anytime} convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O(T^{-1.119})$ for any stopping time $T$, where the stepsize schedule is predetermined without prior knowledge of the stopping time. This res… ▽ More

    Submitted 8 December, 2024; v1 submitted 26 November, 2024; originally announced November 2024.

    Comments: v2: We improve the convergence rate from $O(T^{-1.03})$ to O(T^{-1.119}) through more precise computations

  6. arXiv:2411.17201  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Learning Hierarchical Polynomials of Multiple Nonlinear Features with Three-Layer Networks

    Authors: Hengyu Fu, Zihao Wang, Eshaan Nichani, Jason D. Lee

    Abstract: In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of \textit{multiple nonlinear features} using three-layer neural networks. We examine a broad class of functions of the form $f^{\star}=g^{\star}\circ \bp$, where $\bp:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$ represents mul… ▽ More

    Submitted 26 November, 2024; originally announced November 2024.

    Comments: 78 pages, 4 figures

  7. arXiv:2410.24206  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    Understanding Optimization in Deep Learning with Central Flows

    Authors: Jeremy M. Cohen, Alex Damian, Ameet Talwalkar, Zico Kolter, Jason D. Lee

    Abstract: Optimization in deep learning remains poorly understood, even in the simple setting of deterministic (i.e. full-batch) training. A key difficulty is that much of an optimizer's behavior is implicitly determined by complex oscillatory dynamics, referred to as the "edge of stability." The main contribution of this paper is to show that an optimizer's implicit behavior can be explicitly captured by a… ▽ More

    Submitted 31 October, 2024; originally announced October 2024.

    Comments: first two authors contributed equally; author order determined by coin flip

  8. arXiv:2410.09678  [pdf, ps, other

    cs.LG stat.ML

    Learning Orthogonal Multi-Index Models: A Fine-Grained Information Exponent Analysis

    Authors: Yunwei Ren, Jason D. Lee

    Abstract: The information exponent (Ben Arous et al. [2021]) -- which is equivalent to the lowest degree in the Hermite expansion of the link function for Gaussian single-index models -- has played an important role in predicting the sample complexity of online stochastic gradient descent (SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on the lowes… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

  9. arXiv:2406.08466  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Scaling Laws in Linear Regression: Compute, Parameters, and Data

    Authors: Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, Jason D. Lee

    Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, wh… ▽ More

    Submitted 29 October, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

  10. arXiv:2406.06893  [pdf, other

    stat.ML cs.IT cs.LG

    Transformers Provably Learn Sparse Token Selection While Fully-Connected Nets Cannot

    Authors: Zixuan Wang, Stanley Wei, Daniel Hsu, Jason D. Lee

    Abstract: The transformer architecture has prevailed in various deep learning settings due to its exceptional capabilities to select and compose structural information. Motivated by these capabilities, Sanford et al. proposed the sparse token selection task, in which transformers excel while fully-connected networks (FCNs) fail in the worst case. Building upon that, we strengthen the FCN lower bound to an a… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

  11. arXiv:2406.01581  [pdf, other

    cs.LG stat.ML

    Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit

    Authors: Jason D. Lee, Kazusato Oko, Taiji Suzuki, Denny Wu

    Abstract: We study the problem of gradient descent learning of a single-index target function $f_*(\boldsymbol{x}) = \textstyleσ_*\left(\langle\boldsymbol{x},\boldsymbolθ\rangle\right)$ under isotropic Gaussian data in $\mathbb{R}^d$, where the unknown link function $σ_*:\mathbb{R}\to\mathbb{R}$ has information exponent $p$ (defined as the lowest degree in the Hermite expansion). Prior works showed that gra… ▽ More

    Submitted 22 December, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

    Comments: NeurIPS 2024

  12. arXiv:2403.05529  [pdf, other

    cs.LG stat.ML

    Computational-Statistical Gaps in Gaussian Single-Index Models

    Authors: Alex Damian, Loucas Pillaud-Vivien, Jason D. Lee, Joan Bruna

    Abstract: Single-Index Models are high-dimensional regression problems with planted structure, whereby labels depend on an unknown one-dimensional projection of the input via a generic, non-linear, and potentially non-deterministic transformation. As such, they encompass a broad class of statistical inference tasks, and provide a rich template to study statistical and computational trade-offs in the high-di… ▽ More

    Submitted 12 March, 2024; v1 submitted 8 March, 2024; originally announced March 2024.

    Comments: 61 pages

  13. arXiv:2403.03183  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    How Well Can Transformers Emulate In-context Newton's Method?

    Authors: Angeliki Giannou, Liu Yang, Tianhao Wang, Dimitris Papailiopoulos, Jason D. Lee

    Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher orde… ▽ More

    Submitted 5 March, 2024; originally announced March 2024.

  14. arXiv:2402.14735  [pdf, other

    cs.LG cs.IT stat.ML

    How Transformers Learn Causal Structure with Gradient Descent

    Authors: Eshaan Nichani, Alex Damian, Jason D. Lee

    Abstract: The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structur… ▽ More

    Submitted 13 August, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

    Comments: v2: ICML 2024 camera ready

  15. arXiv:2312.07930  [pdf, other

    cs.LG cs.CL cs.CR cs.IT stat.ML

    Towards Optimal Statistical Watermarking

    Authors: Baihe Huang, Hanlin Zhu, Banghua Zhu, Kannan Ramchandran, Michael I. Jordan, Jason D. Lee, Jiantao Jiao

    Abstract: We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error. We characterize the… ▽ More

    Submitted 6 February, 2024; v1 submitted 13 December, 2023; originally announced December 2023.

  16. arXiv:2312.05134  [pdf, other

    cs.LG stat.ML

    Optimal Multi-Distribution Learning

    Authors: Zihan Zhang, Wenhao Zhan, Yuxin Chen, Simon S. Du, Jason D. Lee

    Abstract: Multi-distribution learning (MDL), which seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions, has emerged as a unified framework in response to the evolving demand for robustness, fairness, multi-group collaboration, etc. Achieving data-efficient MDL necessitates adaptive sampling, also called on-demand sampling, throughout the learning process.… ▽ More

    Submitted 23 May, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

  17. arXiv:2312.00854  [pdf, other

    physics.med-ph cs.AI cs.LG math.NA stat.CO

    A Probabilistic Neural Twin for Treatment Planning in Peripheral Pulmonary Artery Stenosis

    Authors: John D. Lee, Jakob Richter, Martin R. Pfaller, Jason M. Szafron, Karthik Menon, Andrea Zanoni, Michael R. Ma, Jeffrey A. Feinstein, Jacqueline Kreutzer, Alison L. Marsden, Daniele E. Schiavazzi

    Abstract: The substantial computational cost of high-fidelity models in numerical hemodynamics has, so far, relegated their use mainly to offline treatment planning. New breakthroughs in data-driven architectures and optimization techniques for fast surrogate modeling provide an exciting opportunity to overcome these limitations, enabling the use of such technology for time-critical decisions. We discuss an… ▽ More

    Submitted 1 December, 2023; originally announced December 2023.

  18. arXiv:2311.13774  [pdf, other

    cs.LG stat.ML

    Learning Hierarchical Polynomials with Three-Layer Neural Networks

    Authors: Zihao Wang, Eshaan Nichani, Jason D. Lee

    Abstract: We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index mod… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 57 pages

  19. arXiv:2311.11965  [pdf, other

    cs.LG stat.ML

    Provably Efficient CVaR RL in Low-rank MDPs

    Authors: Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen Sun, Jason D. Lee

    Abstract: We study risk-sensitive Reinforcement Learning (RL), where we aim to maximize the Conditional Value at Risk (CVaR) with a fixed risk tolerance $τ$. Prior theoretical work studying risk-sensitive RL focuses on the tabular Markov Decision Processes (MDPs) setting. To extend CVaR RL to settings where state space is large, function approximation must be deployed. We study CVaR RL in low-rank MDPs with… ▽ More

    Submitted 20 November, 2023; originally announced November 2023.

    Comments: The first three authors contribute equally and are ordered randomly

  20. arXiv:2306.12383  [pdf, ps, other

    cs.LG stat.ML

    Sample Complexity for Quadratic Bandits: Hessian Dependent Bounds and Optimal Algorithms

    Authors: Qian Yu, Yining Wang, Baihe Huang, Qi Lei, Jason D. Lee

    Abstract: In stochastic zeroth-order optimization, a problem of practical relevance is understanding how to fully exploit the local geometry of the underlying objective function. We consider a fundamental setting in which the objective function is quadratic, and provide the first tight characterization of the optimal Hessian-dependent sample complexity. Our contribution is twofold. First, from an informatio… ▽ More

    Submitted 25 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

  21. arXiv:2305.18505  [pdf, ps, other

    cs.LG cs.AI math.ST stat.ML

    Provable Reward-Agnostic Preference-Based Reinforcement Learning

    Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

    Abstract: Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals. While PbRL has demonstrated practical success in fine-tuning language models, existing theoretical work focuses on regret minimization and fails to capture most of the practical frameworks. In t… ▽ More

    Submitted 17 April, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

    Comments: ICLR 2024 Spotlight

  22. arXiv:2305.17608  [pdf, other

    cs.LG cs.AI cs.CL math.OC stat.ML

    Reward Collapse in Aligning Large Language Models

    Authors: Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

    Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results i… ▽ More

    Submitted 27 May, 2023; originally announced May 2023.

  23. arXiv:2305.14816  [pdf, ps, other

    cs.LG math.ST stat.ML

    Provable Offline Preference-Based Reinforcement Learning

    Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

    Abstract: In this paper, we investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback where feedback is available in the form of preference between trajectory pairs rather than explicit rewards. Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offl… ▽ More

    Submitted 29 September, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: The first two authors contribute equally

  24. arXiv:2305.11788  [pdf, other

    cs.LG stat.ML

    Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability

    Authors: Jingfeng Wu, Vladimir Braverman, Jason D. Lee

    Abstract: Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS… ▽ More

    Submitted 15 October, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023 camera ready version

  25. arXiv:2305.10633  [pdf, other

    cs.LG cs.IT stat.ML

    Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

    Authors: Alex Damian, Eshaan Nichani, Rong Ge, Jason D. Lee

    Abstract: We focus on the task of learning a single index model $σ(w^\star \cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^\star$ is governed by the information exponent $k^\star$ of the link function $σ$, which is defined as the index of the first nonzero Hermite coefficient of $σ$. Ben Arous et al. (2021) showe… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  26. arXiv:2305.10282  [pdf, ps, other

    cs.LG cs.IT math.ST stat.ML

    Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

    Authors: Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, Yuxin Chen

    Abstract: This paper studies tabular reinforcement learning (RL) in the hybrid setting, which assumes access to both an offline dataset and online interactions with the unknown environment. A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset and enable effective policy fine-tuning. Leveraging recent advances in reward-agnostic e… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  27. arXiv:2305.06986  [pdf, other

    cs.LG stat.ML

    Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

    Authors: Eshaan Nichani, Alex Damian, Jason D. Lee

    Abstract: One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical per… ▽ More

    Submitted 1 April, 2025; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: v3: Improved sample complexity and width dependence (see comment on page 1)

  28. arXiv:2305.04819  [pdf, other

    cs.LG cs.GT cs.MA stat.ML

    Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning

    Authors: Yulai Zhao, Zhuoran Yang, Zhaoran Wang, Jason D. Lee

    Abstract: Policy optimization methods with function approximation are widely used in multi-agent reinforcement learning. However, it remains elusive how to design such algorithms with statistical guarantees. Leveraging a multi-agent performance difference lemma that characterizes the landscape of multi-agent policy optimization, we find that the localized action value function serves as an ideal descent dir… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: ICML 2023

  29. arXiv:2302.04753  [pdf, other

    cs.LG stat.ML

    Efficient displacement convex optimization with particle gradient descent

    Authors: Hadi Daneshmand, Jason D. Lee, Chi Jin

    Abstract: Particle gradient descent, which uses particles to represent a probability measure and performs gradient descent on particles in parallel, is widely used to optimize functions of probability measures. This paper considers particle gradient descent with a finite number of particles and establishes its theoretical guarantees to optimize functions that are \emph{displacement convex} in measures. Conc… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

  30. arXiv:2302.02392  [pdf, ps, other

    cs.LG stat.ML

    Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

    Authors: Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

    Abstract: In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage… ▽ More

    Submitted 13 November, 2023; v1 submitted 5 February, 2023; originally announced February 2023.

    Comments: The original title of this paper was "Refined Value-Based Offline RL under Realizability and Partial Coverage," but it was later changed. This paper has been accepted for NeurIPS 2023

  31. arXiv:2301.11500  [pdf, other

    cs.LG math.OC stat.ML

    Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing

    Authors: Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S. Du, Jason D. Lee

    Abstract: It is believed that Gradient Descent (GD) induces an implicit bias towards good generalization in training machine learning models. This paper provides a fine-grained analysis of the dynamics of GD for the matrix sensing problem, whose goal is to recover a low-rank ground-truth matrix from near-isotropic linear measurements. It is shown that GD with small initialization behaves similarly to the gr… ▽ More

    Submitted 26 January, 2023; originally announced January 2023.

  32. arXiv:2212.03714  [pdf, other

    cs.LG cs.CR stat.ML

    Reconstructing Training Data from Model Gradient, Provably

    Authors: Zihan Wang, Jason D. Lee, Qi Lei

    Abstract: Understanding when and how much a model gradient leaks information about the training sample is an important question in privacy. In this paper, we present a surprising result: even without training or memorizing the data, we can fully reconstruct the training samples from a single gradient query at a randomly chosen parameter value. We prove the identifiability of the training data under mild con… ▽ More

    Submitted 10 June, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

  33. arXiv:2209.15594  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

    Authors: Alex Damian, Eshaan Nichani, Jason D. Lee

    Abstract: Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen e… ▽ More

    Submitted 10 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: ICLR 2023, first two authors contributed equally

  34. arXiv:2206.15144  [pdf, other

    cs.LG cs.IT stat.ML

    Neural Networks can Learn Representations with Gradient Descent

    Authors: Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi

    Abstract: Significant theoretical work has established that in specific regimes, neural networks trained by gradient descent behave like kernel methods. However, in practice, it is known that neural networks strongly outperform their associated kernels. In this work, we explain this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be… ▽ More

    Submitted 30 June, 2022; originally announced June 2022.

    Comments: COLT 2022

  35. arXiv:2206.12081  [pdf, other

    cs.LG stat.ME stat.ML

    Computationally Efficient PAC RL in POMDPs with Latent Determinism and Conditional Embeddings

    Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

    Abstract: We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emis… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

  36. arXiv:2206.12020  [pdf, ps, other

    cs.LG math.ST stat.ME stat.ML

    Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

    Authors: Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun

    Abstract: We study Reinforcement Learning for partially observable dynamical systems using function approximation. We propose a new \textit{Partially Observable Bilinear Actor-Critic framework}, that is general enough to include models such as observable tabular Partially Observable Markov Decision Processes (POMDPs), observable Linear-Quadratic-Gaussian (LQG), Predictive State Representations (PSRs), as we… ▽ More

    Submitted 23 June, 2022; originally announced June 2022.

  37. arXiv:2206.03688  [pdf, other

    cs.LG stat.ML

    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

    Authors: Eshaan Nichani, Yu Bai, Jason D. Lee

    Abstract: A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning… ▽ More

    Submitted 26 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: v2: NeurIPS 2022 camera ready version

  38. arXiv:2206.01588  [pdf, ps, other

    cs.LG stat.ML

    Decentralized Optimistic Hyperpolicy Mirror Descent: Provably No-Regret Learning in Markov Games

    Authors: Wenhao Zhan, Jason D. Lee, Zhuoran Yang

    Abstract: We study decentralized policy learning in Markov games where we control a single agent to play with nonstationary and possibly adversarial opponents. Our goal is to develop a no-regret online learning algorithm that (i) takes actions based on the local information observed by the agent and (ii) is able to find the best policy in hindsight. For such a problem, the nonstationary state transitions du… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

  39. arXiv:2205.09072  [pdf, ps, other

    cs.LG stat.ML

    On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias

    Authors: Itay Safran, Gal Vardi, Jason D. Lee

    Abstract: We study the dynamics and implicit bias of gradient flow (GF) on univariate ReLU neural networks with a single hidden layer in a binary classification setting. We show that when the labels are determined by the sign of a target network with $r$ neurons, with high probability over the initialization of the network and the sampling of the dataset, GF converges in direction (suitably defined) to a ne… ▽ More

    Submitted 2 February, 2023; v1 submitted 18 May, 2022; originally announced May 2022.

  40. arXiv:2203.15664  [pdf, other

    cs.LG stat.ML

    Nearly Minimax Algorithms for Linear Bandits with Shared Representation

    Authors: Jiaqi Yang, Qi Lei, Jason D. Lee, Simon S. Du

    Abstract: We give novel algorithms for multi-task and lifelong linear bandits with shared representation. Specifically, we consider the setting where we play $M$ linear bandits with dimension $d$, each for $T$ rounds, and these $M$ bandit tasks share a common $k(\ll d)$ dimensional linear representation. For both the multi-task setting where we play the tasks concurrently, and the lifelong setting where we… ▽ More

    Submitted 29 March, 2022; originally announced March 2022.

    Comments: 19 pages, 3 figures

  41. arXiv:2202.04634  [pdf, ps, other

    cs.LG stat.ML

    Offline Reinforcement Learning with Realizability and Single-policy Concentrability

    Authors: Wenhao Zhan, Baihe Huang, Audrey Huang, Nan Jiang, Jason D. Lee

    Abstract: Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As a… ▽ More

    Submitted 27 June, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

  42. arXiv:2110.09507  [pdf, other

    cs.LG stat.ML

    Provable Hierarchy-Based Meta-Reinforcement Learning

    Authors: Kurtland Chua, Qi Lei, Jason D. Lee

    Abstract: Hierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing work either assume access to expert-constructed hierarchies, or use hierarchy-learning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure duri… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  43. arXiv:2107.14702  [pdf, ps, other

    cs.GT cs.LG stat.ML

    Towards General Function Approximation in Zero-Sum Markov Games

    Authors: Baihe Huang, Jason D. Lee, Zhaoran Wang, Zhuoran Yang

    Abstract: This paper considers two-player zero-sum finite-horizon Markov games with simultaneous moves. The study focuses on the challenging settings where the value function or the model is parameterized by general function classes. Provably efficient algorithms for both decoupled and {coordinated} settings are developed. In the {decoupled} setting where the agent controls a single player and plays against… ▽ More

    Submitted 30 October, 2021; v1 submitted 30 July, 2021; originally announced July 2021.

  44. arXiv:2107.06466  [pdf, other

    cs.LG stat.ML

    Going Beyond Linear RL: Sample Efficient Neural Function Approximation

    Authors: Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

    Abstract: Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with… ▽ More

    Submitted 25 December, 2021; v1 submitted 13 July, 2021; originally announced July 2021.

  45. arXiv:2107.04518  [pdf, ps, other

    cs.LG stat.ML

    Optimal Gradient-based Algorithms for Non-concave Bandit Optimization

    Authors: Baihe Huang, Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei, Runzhe Wang, Jiaqi Yang

    Abstract: Bandit problems with linear or concave reward have been extensively studied, but relatively few works have studied bandits with non-concave reward. This work considers a large family of bandit problems where the unknown underlying reward function is non-concave, including the low-rank generalized linear bandit problems and two-layer neural network with polynomial activation bandit problem. For the… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

  46. arXiv:2107.02377  [pdf, ps, other

    cs.LG cs.AI math.OC stat.ML

    A Short Note on the Relationship of Information Gain and Eluder Dimension

    Authors: Kaixuan Huang, Sham M. Kakade, Jason D. Lee, Qi Lei

    Abstract: Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic… ▽ More

    Submitted 6 July, 2021; originally announced July 2021.

  47. arXiv:2106.12108  [pdf, other

    cs.LG stat.ML

    Near-Optimal Linear Regression under Distribution Shift

    Authors: Qi Lei, Wei Hu, Jason D. Lee

    Abstract: Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under distribution shift. Our algorithms cover different transfer learning settings including covariate shift and model shift. We also consider when data are generated from either lin… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: ICML 2021

  48. arXiv:2106.06530  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Label Noise SGD Provably Prefers Flat Global Minimizers

    Authors: Alex Damian, Tengyu Ma, Jason D. Lee

    Abstract: In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to… ▽ More

    Submitted 4 December, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: 57 pages, 5 figures, NeurIPS 2021

  49. arXiv:2105.11066  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

    Authors: Wenhao Zhan, Shicong Cen, Baihe Huang, Yuxin Chen, Jason D. Lee, Yuejie Chi

    Abstract: Policy optimization, which finds the desired policy by maximizing value functions via optimization techniques, lies at the heart of reinforcement learning (RL). In addition to value maximization, other practical considerations arise as well, including the need of encouraging exploration, and that of ensuring certain structural properties of the learned policy due to safety, resource and operationa… ▽ More

    Submitted 10 January, 2023; v1 submitted 23 May, 2021; originally announced May 2021.

  50. arXiv:2105.02221  [pdf, other

    cs.LG stat.ML

    How Fine-Tuning Allows for Effective Meta-Learning

    Authors: Kurtland Chua, Qi Lei, Jason D. Lee

    Abstract: Representation learning has been widely studied in the context of meta-learning, enabling rapid learning of new tasks through shared representations. Recent works such as MAML have explored using fine-tuning-based metrics, which measure the ease by which fine-tuning can achieve good performance, as proxies for obtaining representations. We present a theoretical framework for analyzing representati… ▽ More

    Submitted 5 May, 2021; originally announced May 2021.