-
Rethinking Invariance in In-context Learning
Authors:
Lizhe Fang,
Yifei Wang,
Khashayar Gatmiry,
Lei Fang,
Yisen Wang
Abstract:
In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable p…
▽ More
In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.
△ Less
Submitted 8 May, 2025;
originally announced May 2025.
-
On the Role of Depth and Looping for In-Context Learning with Task Diversity
Authors:
Khashayar Gatmiry,
Nikunj Saunshi,
Sashank J. Reddi,
Stefanie Jegelka,
Sanjiv Kumar
Abstract:
The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable abilit…
▽ More
The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable ability of Transformers to learn multiple tasks in context. To this end, we study in-context learning for linear regression with diverse tasks, characterized by data covariance matrices with condition numbers ranging from $[1, κ]$, and highlight the importance of depth in this setting. More specifically, (a) we show theoretical lower bounds of $\log(κ)$ (or $\sqrtκ$) linear attention layers in the unrestricted (or restricted) attention setting and, (b) we show that multilayer Transformers can indeed solve such tasks with a number of layers that matches the lower bounds. However, we show that this expressivity of multilayer Transformer comes at the price of robustness. In particular, multilayer Transformers are not robust to even distributional shifts as small as $O(e^{-L})$ in Wasserstein distance, where $L$ is the depth of the network. We then demonstrate that Looped Transformers -- a special class of multilayer Transformers with weight-sharing -- not only exhibit similar expressive power but are also provably robust under mild assumptions. Besides out-of-distribution generalization, we also show that Looped Transformers are the only models that exhibit a monotonic behavior of loss with respect to depth.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Computing Optimal Regularizers for Online Linear Optimization
Authors:
Khashayar Gatmiry,
Jon Schneider,
Stefanie Jegelka
Abstract:
Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret, but the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a…
▽ More
Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret, but the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL which achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011.
Our algorithm requires preprocessing time and space exponential in the dimension $d$ of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension $d$). We complement this with a lower bound showing that even deciding whether a given regularizer is $α$-strongly-convex with respect to a given norm is NP-hard.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
Simplicity Bias via Global Convergence of Sharpness Minimization
Authors:
Khashayar Gatmiry,
Zhiyuan Li,
Sashank J. Reddi,
Stefanie Jegelka
Abstract:
The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folk…
▽ More
The remarkable generalization ability of neural networks is usually attributed to the implicit bias of SGD, which often yields models with lower complexity using simpler (e.g. linear) and low-rank features. Recent works have provided empirical and theoretical evidence for the bias of particular variants of SGD (such as label noise SGD) toward flatter regions of the loss landscape. Despite the folklore intuition that flat solutions are 'simple', the connection with the simplicity of the final trained model (e.g. low-rank) is not well understood. In this work, we take a step toward bridging this gap by studying the simplicity structure that arises from minimizers of the sharpness for a class of two-layer neural networks. We show that, for any high dimensional training data and certain activations, with small enough step size, label noise SGD always converges to a network that replicates a single linear feature across all neurons; thereby, implying a simple rank one feature matrix. To obtain this result, our main technical contribution is to show that label noise SGD always minimizes the sharpness on the manifold of models with zero loss for two-layer networks. Along the way, we discover a novel property -- a local geodesic convexity -- of the trace of Hessian of the loss at approximate stationary points on the manifold of zero loss, which links sharpness to the geometry of the manifold. This tool may be of independent interest.
△ Less
Submitted 21 October, 2024;
originally announced October 2024.
-
Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
Authors:
Khashayar Gatmiry,
Nikunj Saunshi,
Sashank J. Reddi,
Stefanie Jegelka,
Sanjiv Kumar
Abstract:
The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithms -- such as gradient descent -- with their weights in a single forward pass. Recently, there has been progress in understanding this complex phenomenon from an expressivity point of view, by demonstr…
▽ More
The remarkable capability of Transformers to do reasoning and few-shot learning, without any fine-tuning, is widely conjectured to stem from their ability to implicitly simulate a multi-step algorithms -- such as gradient descent -- with their weights in a single forward pass. Recently, there has been progress in understanding this complex phenomenon from an expressivity point of view, by demonstrating that Transformers can express such multi-step algorithms. However, our knowledge about the more fundamental aspect of its learnability, beyond single layer models, is very limited. In particular, can training Transformers enable convergence to algorithmic solutions? In this work we resolve this for in-context linear regression with linear looped Transformers -- a multi-layer model with weight sharing that is conjectured to have an inductive bias to learn fix-point iterative algorithms. More specifically, for this setting we show that the global minimizer of the population training loss implements multi-step preconditioned gradient descent, with a preconditioner that adapts to the data distribution. Furthermore, we show a fast convergence for gradient flow on the regression loss, despite the non-convexity of the landscape, by proving a novel gradient dominance condition. To our knowledge, this is the first theoretical analysis for multi-layer Transformer in this setting. We further validate our theoretical findings through synthetic experiments.
△ Less
Submitted 10 October, 2024;
originally announced October 2024.
-
What does guidance do? A fine-grained analysis in a simple setting
Authors:
Muthu Chidambaram,
Khashayar Gatmiry,
Sitan Chen,
Holden Lee,
Jianfeng Lu
Abstract:
The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution.
Our main result is to give a fine-grained characterization of…
▽ More
The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution.
Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations.
In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
Adversarial Online Learning with Temporal Feedback Graphs
Authors:
Khashayar Gatmiry,
Jon Schneider
Abstract:
We study a variant of prediction with expert advice where the learner's action at round $t$ is only allowed to depend on losses on a specific subset of the rounds (where the structure of which rounds' losses are visible at time $t$ is provided by a directed "feedback graph" known to the learner). We present a novel learning algorithm for this setting based on a strategy of partitioning the losses…
▽ More
We study a variant of prediction with expert advice where the learner's action at round $t$ is only allowed to depend on losses on a specific subset of the rounds (where the structure of which rounds' losses are visible at time $t$ is provided by a directed "feedback graph" known to the learner). We present a novel learning algorithm for this setting based on a strategy of partitioning the losses across sub-cliques of this graph. We complement this with a lower bound that is tight in many practical settings, and which we conjecture to be within a constant factor of optimal. For the important class of transitive feedback graphs, we prove that this algorithm is efficiently implementable and obtains the optimal regret bound (up to a universal constant).
△ Less
Submitted 29 June, 2024;
originally announced July 2024.
-
Learning Mixtures of Gaussians Using Diffusion Models
Authors:
Khashayar Gatmiry,
Jonathan Kelner,
Holden Lee
Abstract:
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$…
▽ More
We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.
△ Less
Submitted 4 March, 2025; v1 submitted 29 April, 2024;
originally announced April 2024.
-
EM for Mixture of Linear Regression with Clustered Data
Authors:
Amirhossein Reisizadeh,
Khashayar Gatmiry,
Asuman Ozdaglar
Abstract:
Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federa…
▽ More
Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating $d$-dimensional parameters of a two-component mixture of linear regressions problem where each of $m$ nodes generates $n$ samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from $m$ batches of dependent samples each containing $n$ measurements. Discarding the clustered structure in the mixture model, EM is known to require $O(\log(mn/d))$ iterations to reach the statistical accuracy of $O(\sqrt{d/(mn)})$. In contrast, we show that if initialized properly, EM on the structured data requires only $O(1)$ iterations to reach the same statistical accuracy, as long as $m$ grows up as $e^{o(n)}$. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
A Unified Approach to Controlling Implicit Regularization via Mirror Descent
Authors:
Haoyuan Sun,
Khashayar Gatmiry,
Kwangjun Ahn,
Navid Azizan
Abstract:
Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their "preferred" solutions, a phenomenon commonly referred to as implicit regularization. In particular, it h…
▽ More
Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how optimization algorithms impact generalization through their "preferred" solutions, a phenomenon commonly referred to as implicit regularization. In particular, it has been argued that gradient descent (GD) induces an implicit $\ell_2$-norm regularization in regression and classification problems. However, the implicit regularization of different algorithms are confined to either a specific geometry or a particular class of learning problems, indicating a gap in a general approach for controlling the implicit regularization. To address this, we present a unified approach using mirror descent (MD), a notable generalization of GD, to control implicit regularization in both regression and classification settings. More specifically, we show that MD with the general class of homogeneous potential functions converges in direction to a generalized maximum-margin solution for linear classification problems, thereby answering a long-standing question in the classification setting. Further, we show that MD can be implemented efficiently and enjoys fast convergence under suitable conditions. Through comprehensive experiments, we demonstrate that MD is a versatile method to produce learned models with different regularizers, which in turn have different generalization performances.
△ Less
Submitted 11 January, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
The Inductive Bias of Flatness Regularization for Deep Matrix Factorization
Authors:
Khashayar Gatmiry,
Zhiyuan Li,
Ching-Yao Chuang,
Sashank Reddi,
Tengyu Ma,
Stefanie Jegelka
Abstract:
Recent works on over-parameterized neural networks have shown that the stochasticity in optimizers has the implicit regularization effect of minimizing the sharpness of the loss function (in particular, the trace of its Hessian) over the family zero-loss solutions. More explicit forms of flatness regularization also empirically improve the generalization performance. However, it remains unclear wh…
▽ More
Recent works on over-parameterized neural networks have shown that the stochasticity in optimizers has the implicit regularization effect of minimizing the sharpness of the loss function (in particular, the trace of its Hessian) over the family zero-loss solutions. More explicit forms of flatness regularization also empirically improve the generalization performance. However, it remains unclear why and when flatness regularization leads to better generalization. This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in an important setting: learning deep linear networks from linear measurements, also known as \emph{deep matrix factorization}. We show that for all depth greater than one, with the standard Restricted Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters (i.e., the product of all layer matrices), which in turn leads to better generalization. We empirically verify our theoretical findings on synthetic datasets.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Projection-Free Online Convex Optimization via Efficient Newton Iterations
Authors:
Khashayar Gatmiry,
Zakaria Mhammedi
Abstract:
This paper presents new projection-free algorithms for Online Convex Optimization (OCO) over a convex domain $\mathcal{K} \subset \mathbb{R}^d$. Classical OCO algorithms (such as Online Gradient Descent) typically need to perform Euclidean projections onto the convex set $\cK$ to ensure feasibility of their iterates. Alternative algorithms, such as those based on the Frank-Wolfe method, swap poten…
▽ More
This paper presents new projection-free algorithms for Online Convex Optimization (OCO) over a convex domain $\mathcal{K} \subset \mathbb{R}^d$. Classical OCO algorithms (such as Online Gradient Descent) typically need to perform Euclidean projections onto the convex set $\cK$ to ensure feasibility of their iterates. Alternative algorithms, such as those based on the Frank-Wolfe method, swap potentially-expensive Euclidean projections onto $\mathcal{K}$ for linear optimization over $\mathcal{K}$. However, such algorithms have a sub-optimal regret in OCO compared to projection-based algorithms. In this paper, we look at a third type of algorithms that output approximate Newton iterates using a self-concordant barrier for the set of interest. The use of a self-concordant barrier automatically ensures feasibility without the need for projections. However, the computation of the Newton iterates requires a matrix inverse, which can still be expensive. As our main contribution, we show how the stability of the Newton iterates can be leveraged to compute the inverse Hessian only a vanishing fraction of the rounds, leading to a new efficient projection-free OCO algorithm with a state-of-the-art regret bound.
△ Less
Submitted 19 June, 2023;
originally announced June 2023.
-
When does Metropolized Hamiltonian Monte Carlo provably outperform Metropolis-adjusted Langevin algorithm?
Authors:
Yuansi Chen,
Khashayar Gatmiry
Abstract:
We analyze the mixing time of Metropolized Hamiltonian Monte Carlo (HMC) with the leapfrog integrator to sample from a distribution on $\mathbb{R}^d$ whose log-density is smooth, has Lipschitz Hessian in Frobenius norm and satisfies isoperimetry. We bound the gradient complexity to reach $ε$ error in total variation distance from a warm start by $\tilde O(d^{1/4}\text{polylog}(1/ε))$ and demonstra…
▽ More
We analyze the mixing time of Metropolized Hamiltonian Monte Carlo (HMC) with the leapfrog integrator to sample from a distribution on $\mathbb{R}^d$ whose log-density is smooth, has Lipschitz Hessian in Frobenius norm and satisfies isoperimetry. We bound the gradient complexity to reach $ε$ error in total variation distance from a warm start by $\tilde O(d^{1/4}\text{polylog}(1/ε))$ and demonstrate the benefit of choosing the number of leapfrog steps to be larger than 1. To surpass previous analysis on Metropolis-adjusted Langevin algorithm (MALA) that has $\tilde{O}(d^{1/2}\text{polylog}(1/ε))$ dimension dependency in Wu et al. (2022), we reveal a key feature in our proof that the joint distribution of the location and velocity variables of the discretization of the continuous HMC dynamics stays approximately invariant. This key feature, when shown via induction over the number of leapfrog steps, enables us to obtain estimates on moments of various quantities that appear in the acceptance rate control of Metropolized HMC. Moreover, to deal with another bottleneck on the HMC proposal distribution overlap control in the literature, we provide a new approach to upper bound the Kullback-Leibler divergence between push-forwards of the Gaussian distribution through HMC dynamics initialized at two different points. Notably, our analysis does not require log-concavity or independence of the marginals, and only relies on an isoperimetric inequality. To illustrate the applicability of our result, several examples of natural functions that fall into our framework are discussed.
△ Less
Submitted 8 June, 2023; v1 submitted 10 April, 2023;
originally announced April 2023.
-
A Simple Proof of the Mixing of Metropolis-Adjusted Langevin Algorithm under Smoothness and Isoperimetry
Authors:
Yuansi Chen,
Khashayar Gatmiry
Abstract:
We study the mixing time of Metropolis-Adjusted Langevin algorithm (MALA) for sampling a target density on $\mathbb{R}^d$. We assume that the target density satisfies $ψ_μ$-isoperimetry and that the operator norm and trace of its Hessian are bounded by $L$ and $Υ$ respectively. Our main result establishes that, from a warm start, to achieve $ε$-total variation distance to the target density, MALA…
▽ More
We study the mixing time of Metropolis-Adjusted Langevin algorithm (MALA) for sampling a target density on $\mathbb{R}^d$. We assume that the target density satisfies $ψ_μ$-isoperimetry and that the operator norm and trace of its Hessian are bounded by $L$ and $Υ$ respectively. Our main result establishes that, from a warm start, to achieve $ε$-total variation distance to the target density, MALA mixes in $O\left(\frac{(LΥ)^{\frac12}}{ψ_μ^2} \log\left(\frac{1}ε\right)\right)$ iterations. Notably, this result holds beyond the log-concave sampling setting and the mixing time depends on only $Υ$ rather than its upper bound $L d$. In the $m$-strongly logconcave and $L$-log-smooth sampling setting, our bound recovers the previous minimax mixing bound of MALA~\cite{wu2021minimax}.
△ Less
Submitted 8 June, 2023; v1 submitted 8 April, 2023;
originally announced April 2023.
-
Sampling with Barriers: Faster Mixing via Lewis Weights
Authors:
Khashayar Gatmiry,
Jonathan Kelner,
Santosh S. Vempala
Abstract:
We analyze Riemannian Hamiltonian Monte Carlo (RHMC) for sampling a polytope defined by $m$ inequalities in $\R^n$ endowed with the metric defined by the Hessian of a convex barrier function. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate has a linear dependenc…
▽ More
We analyze Riemannian Hamiltonian Monte Carlo (RHMC) for sampling a polytope defined by $m$ inequalities in $\R^n$ endowed with the metric defined by the Hessian of a convex barrier function. The advantage of RHMC over Euclidean methods such as the ball walk, hit-and-run and the Dikin walk is in its ability to take longer steps. However, in all previous work, the mixing rate has a linear dependence on the number of inequalities. We introduce a hybrid of the Lewis weights barrier and the standard logarithmic barrier and prove that the mixing rate for the corresponding RHMC is bounded by $\tilde O(m^{1/3}n^{4/3})$, improving on the previous best bound of $\tilde O(mn^{2/3})$ (based on the log barrier). This continues the general parallels between optimization and sampling, with the latter typically leading to new tools and more refined analysis. To prove our main results, we have to overcomes several challenges relating to the smoothness of Hamiltonian curves and the self-concordance properties of the barrier. In the process, we give a general framework for the analysis of Markov chains on Riemannian manifolds, derive new smoothness bounds on Hamiltonian curves, a central topic of comparison geometry, and extend self-concordance to the infinity norm, which gives sharper bounds; these properties appear to be of independent interest.
△ Less
Submitted 19 April, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Near-Optimal Algorithms for Group Distributionally Robust Optimization and Beyond
Authors:
Tasuku Soma,
Khashayar Gatmiry,
Sharut Gupta,
Stefanie Jegelka
Abstract:
Distributionally robust optimization (DRO) can improve the robustness and fairness of learning methods. In this paper, we devise stochastic algorithms for a class of DRO problems including group DRO, subpopulation fairness, and empirical conditional value at risk (CVaR) optimization. Our new algorithms achieve faster convergence rates than existing algorithms for multiple DRO settings. We also pro…
▽ More
Distributionally robust optimization (DRO) can improve the robustness and fairness of learning methods. In this paper, we devise stochastic algorithms for a class of DRO problems including group DRO, subpopulation fairness, and empirical conditional value at risk (CVaR) optimization. Our new algorithms achieve faster convergence rates than existing algorithms for multiple DRO settings. We also provide a new information-theoretic lower bound that implies our bounds are tight for group DRO. Empirically, too, our algorithms outperform known methods.
△ Less
Submitted 31 January, 2025; v1 submitted 27 December, 2022;
originally announced December 2022.
-
Bandit Algorithms for Prophet Inequality and Pandora's Box
Authors:
Khashayar Gatmiry,
Thomas Kesselheim,
Sahil Singla,
Yifan Wang
Abstract:
The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the $n$ underlying random variables are given as input to the algorithm. Since in practice these distributions nee…
▽ More
The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the $n$ underlying random variables are given as input to the algorithm. Since in practice these distributions need to be learned, we initiate the study of such stochastic problems in the Multi-Armed Bandits model.
In the Multi-Armed Bandits model we interact with $n$ unknown distributions over $T$ rounds: in round $t$ we play a policy $x^{(t)}$ and receive a partial (bandit) feedback on the performance of $x^{(t)}$. The goal is to minimize the regret, which is the difference over $T$ rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the partial feedback. Our main results give near-optimal $\tilde{O}(\mathsf{poly}(n)\sqrt{T})$ total regret algorithms for both Prophet Inequality and Pandora's Box.
Our proofs proceed by maintaining confidence intervals on the unknown indices of the optimal policy. The exploration-exploitation tradeoff prevents us from directly refining these confidence intervals, so the main technique is to design a regret upper bound that is learnable while playing low-regret Bandit policies.
△ Less
Submitted 6 December, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Quasi-Newton Steps for Efficient Online Exp-Concave Optimization
Authors:
Zakaria Mhammedi,
Khashayar Gatmiry
Abstract:
The aim of this paper is to design computationally-efficient and optimal algorithms for the online and stochastic exp-concave optimization settings. Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set. However, such algorithms perform so-called generalized pro…
▽ More
The aim of this paper is to design computationally-efficient and optimal algorithms for the online and stochastic exp-concave optimization settings. Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set. However, such algorithms perform so-called generalized projections whenever their iterates step outside the feasible set. Such generalized projections require $Ω(d^3)$ arithmetic operations even for simple sets such a Euclidean ball, making the total runtime of ONS of order $d^3 T$ after $T$ rounds, in the worst-case. In this paper, we side-step generalized projections by using a self-concordant barrier as a regularizer to compute the Newton steps. This ensures that the iterates are always within the feasible set without requiring projections. This approach still requires the computation of the inverse of the Hessian of the barrier at every step. However, using the stability properties of the Newton steps, we show that the inverse of the Hessians can be efficiently approximated via Taylor expansions for most rounds, resulting in a $O(d^2 T +d^ω\sqrt{T})$ total computational complexity, where $ω$ is the exponent of matrix multiplication. In the stochastic setting, we show that this translates into a $O(d^3/ε)$ computational complexity for finding an $ε$-suboptimal point, answering an open question by Koren 2013. We first show these new results for the simple case where the feasible set is a Euclidean ball. Then, to move to general convex set, we use a reduction to Online Convex Optimization over the Euclidean ball. Our final algorithm can be viewed as a more efficient version of ONS.
△ Less
Submitted 14 February, 2023; v1 submitted 2 November, 2022;
originally announced November 2022.
-
On the generalization of learning algorithms that do not converge
Authors:
Nisha Chandramoorthy,
Andreas Loukas,
Khashayar Gatmiry,
Stefanie Jegelka
Abstract:
Generalization analyses of deep learning typically assume that the training converges to a fixed point. But, recent results indicate that in practice, the weights of deep neural networks optimized with stochastic gradient descent often oscillate indefinitely. To reduce this discrepancy between theory and practice, this paper focuses on the generalization of neural networks whose training dynamics…
▽ More
Generalization analyses of deep learning typically assume that the training converges to a fixed point. But, recent results indicate that in practice, the weights of deep neural networks optimized with stochastic gradient descent often oscillate indefinitely. To reduce this discrepancy between theory and practice, this paper focuses on the generalization of neural networks whose training dynamics do not necessarily converge to fixed points. Our main contribution is to propose a notion of statistical algorithmic stability (SAS) that extends classical algorithmic stability to non-convergent algorithms and to study its connection to generalization. This ergodic-theoretic approach leads to new insights when compared to the traditional optimization and learning theory perspectives. We prove that the stability of the time-asymptotic behavior of a learning algorithm relates to its generalization and empirically demonstrate how loss dynamics can provide clues to generalization performance. Our findings provide evidence that networks that "train stably generalize better" even when the training continues indefinitely and the weights do not converge.
△ Less
Submitted 19 August, 2022; v1 submitted 16 August, 2022;
originally announced August 2022.
-
Convergence of the Riemannian Langevin Algorithm
Authors:
Khashayar Gatmiry,
Santosh S. Vempala
Abstract:
We study the Riemannian Langevin Algorithm for the problem of sampling from a distribution with density $ν$ with respect to the natural measure on a manifold with metric $g$. We assume that the target density satisfies a log-Sobolev inequality with respect to the metric and prove that the manifold generalization of the Unadjusted Langevin Algorithm converges rapidly to $ν$ for Hessian manifolds. T…
▽ More
We study the Riemannian Langevin Algorithm for the problem of sampling from a distribution with density $ν$ with respect to the natural measure on a manifold with metric $g$. We assume that the target density satisfies a log-Sobolev inequality with respect to the metric and prove that the manifold generalization of the Unadjusted Langevin Algorithm converges rapidly to $ν$ for Hessian manifolds. This allows us to reduce the problem of sampling non-smooth (constrained) densities in ${\bf R}^n$ to sampling smooth densities over appropriate manifolds, while needing access only to the gradient of the log-density, and this, in turn, to sampling from the natural Brownian motion on the manifold. Our main analytic tools are (1) an extension of self-concordance to manifolds, and (2) a stochastic approach to bounding smoothness on manifolds. A special case of our approach is sampling isoperimetric densities restricted to polytopes by using the metric defined by the logarithmic barrier.
△ Less
Submitted 22 April, 2022;
originally announced April 2022.
-
Testing Determinantal Point Processes
Authors:
Khashayar Gatmiry,
Maryam Aliakbarpour,
Stefanie Jegelka
Abstract:
Determinantal point processes (DPPs) are popular probabilistic models of diversity. In this paper, we investigate DPPs from a new perspective: property testing of distributions. Given sample access to an unknown distribution $q$ over the subsets of a ground set, we aim to distinguish whether $q$ is a DPP distribution, or $ε$-far from all DPP distributions in $\ell_1$-distance. In this work, we pro…
▽ More
Determinantal point processes (DPPs) are popular probabilistic models of diversity. In this paper, we investigate DPPs from a new perspective: property testing of distributions. Given sample access to an unknown distribution $q$ over the subsets of a ground set, we aim to distinguish whether $q$ is a DPP distribution, or $ε$-far from all DPP distributions in $\ell_1$-distance. In this work, we propose the first algorithm for testing DPPs. Furthermore, we establish a matching lower bound on the sample complexity of DPP testing. This lower bound also extends to showing a new hardness result for the problem of testing the more general class of log-submodular distributions.
△ Less
Submitted 9 August, 2020;
originally announced August 2020.
-
Non-submodular Function Maximization subject to a Matroid Constraint, with Applications
Authors:
Khashayar Gatmiry,
Manuel Gomez-Rodriguez
Abstract:
The standard greedy algorithm has been recently shown to enjoy approximation guarantees for constrained non-submodular nondecreasing set function maximization. While these recent results allow to better characterize the empirical success of the greedy algorithm, they are only applicable to simple cardinality constraints. In this paper, we study the problem of maximizing a non-submodular nondecreas…
▽ More
The standard greedy algorithm has been recently shown to enjoy approximation guarantees for constrained non-submodular nondecreasing set function maximization. While these recent results allow to better characterize the empirical success of the greedy algorithm, they are only applicable to simple cardinality constraints. In this paper, we study the problem of maximizing a non-submodular nondecreasing set function subject to a general matroid constraint. We first show that the standard greedy algorithm offers an approximation factor of $\frac{0.4 γ^{2}}{\sqrt{γr} + 1}$, where $γ$ is the submodularity ratio of the function and $r$ is the rank of the matroid. Then, we show that the same greedy algorithm offers a constant approximation factor of $(1 + 1/(1-α))^{-1}$, where $α$ is the generalized curvature of the function. In addition, we demonstrate that these approximation guarantees are applicable to several real-world applications in which the submodularity ratio and the generalized curvature can be bounded. Finally, we show that our greedy algorithm does achieve a competitive performance in practice using a variety of experiments on synthetic and real-world data.
△ Less
Submitted 8 October, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Information Theoretic Bounds on Optimal Worst-case Error in Binary Mixture Identification
Authors:
Khashayar Gatmiry,
Seyed Abolfazl Motahari
Abstract:
Identification of latent binary sequences from a pool of noisy observations has a wide range of applications in both statistical learning and population genetics. Each observed sequence is the result of passing one of the latent mother-sequences through a binary symmetric channel, which makes this configuration analogous to a special case of Bernoulli Mixture Models. This paper aims to attain an a…
▽ More
Identification of latent binary sequences from a pool of noisy observations has a wide range of applications in both statistical learning and population genetics. Each observed sequence is the result of passing one of the latent mother-sequences through a binary symmetric channel, which makes this configuration analogous to a special case of Bernoulli Mixture Models. This paper aims to attain an asymptotically tight upper-bound on the error of Maximum Likelihood mixture identification in such problems. The obtained results demonstrate fundamental guarantees on the inference accuracy of the optimal estimator. To this end, we set out to find the closest pair of discrete distributions with respect to the Chernoff Information measure. We provide a novel technique to lower bound the Chernoff Information in an efficient way. We also show that a drastic phase transition occurs at noise level 0.25. Our findings reveal that the identification problem becomes much harder as the noise probability exceeds this threshold.
△ Less
Submitted 27 November, 2018; v1 submitted 18 November, 2018;
originally announced November 2018.