-
Computational Intractability of Strategizing against Online Learners
Authors:
Angelos Assos,
Yuval Dagan,
Nived Rajaraman
Abstract:
Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the opti…
▽ More
Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases - such as structured auction settings or contract design - no general efficient algorithm is known.
In this paper, we establish a strong computational hardness result: unless $\mathsf{P} = \mathsf{NP}$, no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an $Ω(T)$ hardness bound, significantly strengthening previous work that only showed an additive $Θ(1)$ impossibility result. Furthermore, while the prior hardness result focused on learners using fictitious play - an algorithm that is not no-regret - we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings.
△ Less
Submitted 6 March, 2025;
originally announced March 2025.
-
Scaling Test-Time Compute Without Verification or RL is Suboptimal
Authors:
Amrith Setlur,
Nived Rajaraman,
Sergey Levine,
Aviral Kumar
Abstract:
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL)…
▽ More
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: first, distilling successful search or thinking traces; and second, using verification (e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős, 1945]. This implies a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF methods widening as test-time budget grows. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
△ Less
Submitted 18 February, 2025; v1 submitted 17 February, 2025;
originally announced February 2025.
-
From Markov to Laplace: How Mamba In-Context Learns Markov Chains
Authors:
Marco Bondaschi,
Nived Rajaraman,
Xiuying Wei,
Kannan Ramchandran,
Razvan Pascanu,
Caglar Gulcehre,
Michael Gastpar,
Ashok Vardhan Makkuva
Abstract:
While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on…
▽ More
While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Transformers on Markov Data: Constant Depth Suffices
Authors:
Nived Rajaraman,
Marco Bondaschi,
Kannan Ramchandran,
Michael Gastpar,
Ashok Vardhan Makkuva
Abstract:
Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contr…
▽ More
Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Toward a Theory of Tokenization in LLMs
Authors:
Nived Rajaraman,
Jiantao Jiao,
Kannan Ramchandran
Abstract:
While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data ge…
▽ More
While there has been a large body of research attempting to circumvent tokenization for language modeling (Clark et al., 2022; Xue et al., 2022), the current consensus is that it is a necessary initial step for designing state-of-the-art performant language models. In this paper, we investigate tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes. When trained on data drawn from certain simple $k^{\text{th}}$-order Markov processes for $k > 1$, transformers exhibit a surprising phenomenon - in the absence of tokenization, they empirically fail to learn the right distribution and predict characters according to a unigram model (Makkuva et al., 2024). With the addition of tokenization, however, we empirically observe that transformers break through this barrier and are able to model the probabilities of sequences drawn from the source near-optimally, achieving small cross-entropy loss. With this observation as starting point, we study the end-to-end cross-entropy loss achieved by transformers with and without tokenization. With the appropriate tokenization, we show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $k^{\text{th}}$-order Markov sources near optimally. Our analysis provides a justification for the use of tokenization in practice through studying the behavior of transformers on Markovian data.
△ Less
Submitted 10 April, 2025; v1 submitted 12 April, 2024;
originally announced April 2024.
-
Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing
Authors:
Nived Rajaraman,
Devvrit,
Aryan Mokhtari,
Kannan Ramchandran
Abstract:
Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. In fact, several practical studies have shown that if a pruned model is fine-tuned with some gradient-based updates it generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the…
▽ More
Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. In fact, several practical studies have shown that if a pruned model is fine-tuned with some gradient-based updates it generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the complexity of trained models, there is very little known about the theory behind this success. In this paper, we address this issue by investigating the pruning + fine-tuning framework on the overparameterized matrix sensing problem with the ground truth $U_\star \in \mathbb{R}^{d \times r}$ and the overparameterized model $U \in \mathbb{R}^{d \times k}$ with $k \gg r$. We study the approximate local minima of the mean square error, augmented with a smooth version of a group Lasso regularizer, $\sum_{i=1}^k \| U e_i \|_2$. In particular, we provably show that pruning all the columns below a certain explicit $\ell_2$-norm threshold results in a solution $U_{\text{prune}}$ which has the minimum number of columns $r$, yet close to the ground truth in training loss. Moreover, in the subsequent fine-tuning phase, gradient descent initialized at $U_{\text{prune}}$ converges at a linear rate to its limit. While our analysis provides insights into the role of regularization in pruning, we also show that running gradient descent in the absence of regularization results in models which {are not suitable for greedy pruning}, i.e., many columns could have their $\ell_2$ norm comparable to that of the maximum. To the best of our knowledge, our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.
△ Less
Submitted 4 June, 2023; v1 submitted 20 March, 2023;
originally announced March 2023.
-
Statistical Complexity and Optimal Algorithms for Non-linear Ridge Bandits
Authors:
Nived Rajaraman,
Yanjun Han,
Jiantao Jiao,
Kannan Ramchandran
Abstract:
We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, ac…
▽ More
We consider the sequential decision-making problem where the mean outcome is a non-linear function of the chosen action. Compared with the linear model, two curious phenomena arise in non-linear models: first, in addition to the "learning phase" with a standard parametric rate for estimation or regret, there is an "burn-in period" with a fixed cost determined by the non-linear function; second, achieving the smallest burn-in cost requires new exploration algorithms. For a special family of non-linear functions named ridge functions in the literature, we derive upper and lower bounds on the optimal burn-in cost, and in addition, on the entire learning trajectory during the burn-in period via differential equations. In particular, a two-stage algorithm that first finds a good initial action and then treats the problem as locally linear is statistically optimal. In contrast, several classical algorithms, such as UCB and algorithms relying on regression oracles, are provably suboptimal.
△ Less
Submitted 9 January, 2024; v1 submitted 12 February, 2023;
originally announced February 2023.
-
Sample Efficient Deep Reinforcement Learning via Local Planning
Authors:
Dong Yin,
Sridhar Thiagarajan,
Nevena Lazic,
Nived Rajaraman,
Botao Hao,
Csaba Szepesvari
Abstract:
The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probab…
▽ More
The focus of this work is sample-efficient deep reinforcement learning (RL) with a simulator. One useful property of simulators is that it is typically easy to reset the environment to a previously observed state. We propose an algorithmic framework, named uncertainty-first local planning (UFLP), that takes advantage of this property. Concretely, in each data collection iteration, with some probability, our meta-algorithm resets the environment to an observed state which has high uncertainty, instead of sampling according to the initial-state distribution. The agent-environment interaction then proceeds as in the standard online RL setting. We demonstrate that this simple procedure can dramatically improve the sample cost of several baseline RL algorithms on difficult exploration tasks. Notably, with our framework, we can achieve super-human performance on the notoriously hard Atari game, Montezuma's Revenge, with a simple (distributional) double DQN. Our work can be seen as an efficient approximate implementation of an existing algorithm with theoretical guarantees, which offers an interpretation of the positive empirical results.
△ Less
Submitted 3 July, 2023; v1 submitted 29 January, 2023;
originally announced January 2023.
-
Spectral Regularization Allows Data-frugal Learning over Combinatorial Spaces
Authors:
Amirali Aghazadeh,
Nived Rajaraman,
Tony Tu,
Kannan Ramchandran
Abstract:
Data-driven machine learning models are being increasingly employed in several important inference problems in biology, chemistry, and physics which require learning over combinatorial spaces. Recent empirical evidence (see, e.g., [1], [2], [3]) suggests that regularizing the spectral representation of such models improves their generalization power when labeled data is scarce. However, despite th…
▽ More
Data-driven machine learning models are being increasingly employed in several important inference problems in biology, chemistry, and physics which require learning over combinatorial spaces. Recent empirical evidence (see, e.g., [1], [2], [3]) suggests that regularizing the spectral representation of such models improves their generalization power when labeled data is scarce. However, despite these empirical studies, the theoretical underpinning of when and how spectral regularization enables improved generalization is poorly understood. In this paper, we focus on learning pseudo-Boolean functions and demonstrate that regularizing the empirical mean squared error by the L_1 norm of the spectral transform of the learned function reshapes the loss landscape and allows for data-frugal learning, under a restricted secant condition on the learner's empirical error measured against the ground truth function. Under a weaker quadratic growth condition, we show that stationary points which also approximately interpolate the training data points achieve statistically optimal generalization performance. Complementing our theory, we empirically demonstrate that running gradient descent on the regularized loss results in a better generalization performance compared to baseline algorithms in several data-scarce real-world problems.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Minimax Optimal Online Imitation Learning via Replay Estimation
Authors:
Gokul Swamy,
Nived Rajaraman,
Matthew Peng,
Sanjiban Choudhury,
J. Andrew Bagnell,
Zhiwei Steven Wu,
Jiantao Jiao,
Kannan Ramchandran
Abstract:
Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap tha…
▽ More
Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
△ Less
Submitted 14 January, 2023; v1 submitted 30 May, 2022;
originally announced May 2022.
-
Semi-supervised Active Regression
Authors:
Fnu Devvrit,
Nived Rajaraman,
Pranjal Awasthi
Abstract:
Labelled data often comes at a high cost as it may require recruiting human labelers or running costly experiments. At the same time, in many practical scenarios, one already has access to a partially labelled, potentially biased dataset that can help with the learning task at hand. Motivated by such settings, we formally initiate a study of $semi-supervised$ $active$ $learning$ through the frame…
▽ More
Labelled data often comes at a high cost as it may require recruiting human labelers or running costly experiments. At the same time, in many practical scenarios, one already has access to a partially labelled, potentially biased dataset that can help with the learning task at hand. Motivated by such settings, we formally initiate a study of $semi-supervised$ $active$ $learning$ through the frame of linear regression. In this setting, the learner has access to a dataset $X \in \mathbb{R}^{(n_1+n_2) \times d}$ which is composed of $n_1$ unlabelled examples that an algorithm can actively query, and $n_2$ examples labelled a-priori. Concretely, denoting the true labels by $Y \in \mathbb{R}^{n_1 + n_2}$, the learner's objective is to find $\widehatβ \in \mathbb{R}^d$ such that, \begin{equation}
\| X \widehatβ - Y \|_2^2 \le (1 + ε) \min_{β\in \mathbb{R}^d} \| X β- Y \|_2^2 \end{equation} while making as few additional label queries as possible. In order to bound the label queries, we introduce an instance dependent parameter called the reduced rank, denoted by $R_X$, and propose an efficient algorithm with query complexity $O(R_X/ε)$. This result directly implies improved upper bounds for two important special cases: (i) active ridge regression, and (ii) active kernel ridge regression, where the reduced-rank equates to the statistical dimension, $sd_λ$ and effective dimension, $d_λ$ of the problem respectively, where $λ\ge 0$ denotes the regularization parameter. For active ridge regression we also prove a matching lower bound of $O(sd_λ/ ε)$ on the query complexity of any algorithm. This subsumes prior work that only considered the unregularized case, i.e., $λ= 0$.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally
Authors:
Nived Rajaraman,
Yanjun Han,
Lin F. Yang,
Kannan Ramchandran,
Jiantao Jiao
Abstract:
We study the statistical limits of Imitation Learning (IL) in episodic Markov Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the known-transition setting where the learner is provided a dataset of $N$ length-$H$ trajectories from a deterministic expert policy and knows the MDP transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the suboptimality using t…
▽ More
We study the statistical limits of Imitation Learning (IL) in episodic Markov Decision Processes (MDPs) with a state space $\mathcal{S}$. We focus on the known-transition setting where the learner is provided a dataset of $N$ length-$H$ trajectories from a deterministic expert policy and knows the MDP transition. We establish an upper bound $O(|\mathcal{S}|H^{3/2}/N)$ for the suboptimality using the Mimic-MD algorithm in Rajaraman et al (2020) which we prove to be computationally efficient. In contrast, we show the minimax suboptimality grows as $Ω( H^{3/2}/N)$ when $|\mathcal{S}|\geq 3$ while the unknown-transition setting suffers from a larger sharp rate $Θ(|\mathcal{S}|H^2/N)$ (Rajaraman et al (2020)). The lower bound is established by proving a two-way reduction between IL and the value estimation problem of the unknown expert policy under any given reward function, as well as building connections with linear functional estimation with subsampled observations. We further show that under the additional assumption that the expert is optimal for the true reward function, there exists an efficient algorithm, which we term as Mimic-Mixture, that provably achieves suboptimality $O(1/N)$ for arbitrary 3-state MDPs with rewards only at the terminal layer. In contrast, no algorithm can achieve suboptimality $O(\sqrt{H}/N)$ with high probability if the expert is not constrained to be optimal. Our work formally establishes the benefit of the expert optimal assumption in the known transition setting, while Rajaraman et al (2020) showed it does not help when transitions are unknown.
△ Less
Submitted 25 February, 2021;
originally announced February 2021.
-
How good is Good-Turing for Markov samples?
Authors:
Prafulla Chandra,
Andrew Thangaraj,
Nived Rajaraman
Abstract:
The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for…
▽ More
The Good-Turing (GT) estimator for the missing mass (i.e., total probability of missing symbols) in $n$ samples is the number of symbols that appeared exactly once divided by $n$. For i.i.d. samples, the bias and squared-error risk of the GT estimator can be shown to fall as $1/n$ by bounding the expected error uniformly over all symbols. In this work, we study convergence of the GT estimator for missing stationary mass (i.e., total stationary probability of missing symbols) of Markov samples on an alphabet $\mathcal{X}$ with stationary distribution $[π_x:x \in \mathcal{X}]$ and transition probability matrix (t.p.m.) $P$. This is an important and interesting problem because GT is widely used in applications with temporal dependencies such as language models assigning probabilities to word sequences, which are modelled as Markov. We show that convergence of GT depends on convergence of $(P^{\sim x})^n$, where $P^{\sim x}$ is $P$ with the $x$-th column zeroed out. This, in turn, depends on the Perron eigenvalue $λ^{\sim x}$ of $P^{\sim x}$ and its relationship with $π_x$ uniformly over $x$. For randomly generated t.p.ms and t.p.ms derived from New York Times and Charles Dickens corpora, we numerically exhibit such uniform-over-$x$ relationships between $λ^{\sim x}$ and $π_x$. This supports the observed success of GT in language models and practical text data scenarios. For Markov chains with rank-2, diagonalizable t.p.ms having spectral gap $β$, we show minimax rate upper and lower bounds of $1/(nβ^5)$ and $1/(nβ)$, respectively, for the estimation of stationary missing mass. This theoretical result extends the $1/n$ minimax rate for i.i.d. or rank-1 t.p.ms to rank-2 Markov, and is a first such minimax rate result for missing mass of Markov samples.
△ Less
Submitted 27 May, 2023; v1 submitted 3 February, 2021;
originally announced February 2021.
-
FastSecAgg: Scalable Secure Aggregation for Privacy-Preserving Federated Learning
Authors:
Swanand Kadhe,
Nived Rajaraman,
O. Ozan Koyluoglu,
Kannan Ramchandran
Abstract:
Recent attacks on federated learning demonstrate that keeping the training data on clients' devices does not provide sufficient privacy, as the model parameters shared by clients can leak information about their training data. A 'secure aggregation' protocol enables the server to aggregate clients' models in a privacy-preserving manner. However, existing secure aggregation protocols incur high com…
▽ More
Recent attacks on federated learning demonstrate that keeping the training data on clients' devices does not provide sufficient privacy, as the model parameters shared by clients can leak information about their training data. A 'secure aggregation' protocol enables the server to aggregate clients' models in a privacy-preserving manner. However, existing secure aggregation protocols incur high computation/communication costs, especially when the number of model parameters is larger than the number of clients participating in an iteration -- a typical scenario in federated learning.
In this paper, we propose a secure aggregation protocol, FastSecAgg, that is efficient in terms of computation and communication, and robust to client dropouts. The main building block of FastSecAgg is a novel multi-secret sharing scheme, FastShare, based on the Fast Fourier Transform (FFT), which may be of independent interest. FastShare is information-theoretically secure, and achieves a trade-off between the number of secrets, privacy threshold, and dropout tolerance. Riding on the capabilities of FastShare, we prove that FastSecAgg is (i) secure against the server colluding with 'any' subset of some constant fraction (e.g. $\sim10\%$) of the clients in the honest-but-curious setting; and (ii) tolerates dropouts of a 'random' subset of some constant fraction (e.g. $\sim10\%$) of the clients. FastSecAgg achieves significantly smaller computation cost than existing schemes while achieving the same (orderwise) communication cost. In addition, it guarantees security against adaptive adversaries, which can perform client corruptions dynamically during the execution of the protocol.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
Toward the Fundamental Limits of Imitation Learning
Authors:
Nived Rajaraman,
Lin F. Yang,
Jiantao Jiao,
Kannan Ramachandran
Abstract:
Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations. In this paper, we focus on understanding the minimax statistical limits of IL in episodic Markov Decision Processes (MDPs). We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with t…
▽ More
Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations. In this paper, we focus on understanding the minimax statistical limits of IL in episodic Markov Decision Processes (MDPs). We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with the MDP. Here, we show that the policy which mimics the expert whenever possible is in expectation $\lesssim \frac{|\mathcal{S}| H^2 \log (N)}{N}$ suboptimal compared to the value of the expert, even when the expert follows an arbitrary stochastic policy. Here $\mathcal{S}$ is the state space, and $H$ is the length of the episode. Furthermore, we establish a suboptimality lower bound of $\gtrsim |\mathcal{S}| H^2 / N$ which applies even if the expert is constrained to be deterministic, or if the learner is allowed to actively query the expert at visited states while interacting with the MDP for $N$ episodes. To our knowledge, this is the first algorithm with suboptimality having no dependence on the number of actions, under no additional assumptions. We then propose a novel algorithm based on minimum-distance functionals in the setting where the transition model is given and the expert is deterministic. The algorithm is suboptimal by $\lesssim \min \{ H \sqrt{|\mathcal{S}| / N} ,\ |\mathcal{S}| H^{3/2} / N \}$, showing that knowledge of transition improves the minimax rate by at least a $\sqrt{H}$ factor.
△ Less
Submitted 13 September, 2020;
originally announced September 2020.
-
Not Just Age but Age and Quality of Information
Authors:
Nived Rajaraman,
Rahul Vaze,
Goonwanth Reddy
Abstract:
A versatile scheduling problem to model a three-way tradeoff between delay/age, distortion, and energy is considered. The considered problem called the age and quality of information (AQI) is to select which packets to transmit at each time slot to minimize a linear combination of the distortion cost, the age/delay cost and the energy transmission cost in an online fashion. AQI generalizes multipl…
▽ More
A versatile scheduling problem to model a three-way tradeoff between delay/age, distortion, and energy is considered. The considered problem called the age and quality of information (AQI) is to select which packets to transmit at each time slot to minimize a linear combination of the distortion cost, the age/delay cost and the energy transmission cost in an online fashion. AQI generalizes multiple important problems such as age of information (AoI), the remote estimation problem with sampling constraint, the classical speed scaling problem among others. The worst case input model is considered, where the performance metric is the competitive ratio. A greedy algorithm is proposed that is shown to be 2-competitive, independent of all parameters of the problem. For the special case of AQI problem, a greedy online maximum weight matching based algorithm is also shown to be 2-competitive.
△ Less
Submitted 20 December, 2018;
originally announced December 2018.
-
Submodular Maximization Under A Matroid Constraint: Asking more from an old friend, the Greedy Algorithm
Authors:
Nived Rajaraman,
Rahul Vaze
Abstract:
The classical problem of maximizing a submodular function under a matroid constraint is considered. Defining a new measure for the increments made by the greedy algorithm at each step, called the discriminant, improved approximation ratio guarantees are derived for the greedy algorithm. At each step, discriminant measures the multiplicative gap in the incremental valuation between the item chosen…
▽ More
The classical problem of maximizing a submodular function under a matroid constraint is considered. Defining a new measure for the increments made by the greedy algorithm at each step, called the discriminant, improved approximation ratio guarantees are derived for the greedy algorithm. At each step, discriminant measures the multiplicative gap in the incremental valuation between the item chosen by the greedy algorithm and the largest potential incremental valuation for eligible items not selected by it. The new guarantee subsumes all the previous known results for the greedy algorithm, including the curvature based ones, and the derived guarantees are shown to be tight via constructing specific instances. More refined approximation guarantee is derived for a special case called the submodular welfare maximization/partition problem that is also tight, for both the offline and the online case.
△ Less
Submitted 30 October, 2018;
originally announced October 2018.
-
Minimax Risk for Missing Mass Estimation
Authors:
Nikhilesh Rajaraman,
Andrew Thangaraj,
Ananda Theertha Suresh
Abstract:
The problem of estimating the missing mass or total probability of unseen elements in a sequence of $n$ random samples is considered under the squared error loss function. The worst-case risk of the popular Good-Turing estimator is shown to be between $0.6080/n$ and $0.6179/n$. The minimax risk is shown to be lower bounded by $0.25/n$. This appears to be the first such published result on minimax…
▽ More
The problem of estimating the missing mass or total probability of unseen elements in a sequence of $n$ random samples is considered under the squared error loss function. The worst-case risk of the popular Good-Turing estimator is shown to be between $0.6080/n$ and $0.6179/n$. The minimax risk is shown to be lower bounded by $0.25/n$. This appears to be the first such published result on minimax risk for estimation of missing mass, which has several practical and theoretical applications.
△ Less
Submitted 14 May, 2017;
originally announced May 2017.