-
The Data-Driven Censored Newsvendor Problem
Authors:
Chamsi Hssaine,
Sean R. Sinclair
Abstract:
We study a censored variant of the data-driven newsvendor problem, where the decision-maker must select an ordering quantity that minimizes expected overage and underage costs based only on offline censored sales data, rather than historical demand realizations. Our goal is to understand how the degree of historical demand censoring affects the performance of any learning algorithm for this proble…
▽ More
We study a censored variant of the data-driven newsvendor problem, where the decision-maker must select an ordering quantity that minimizes expected overage and underage costs based only on offline censored sales data, rather than historical demand realizations. Our goal is to understand how the degree of historical demand censoring affects the performance of any learning algorithm for this problem. To isolate this impact, we adopt a distributionally robust optimization framework, evaluating policies according to their worst-case regret over an ambiguity set of distributions. This set is defined by the largest historical order quantity (the observable boundary of the dataset), and contains all distributions matching the true demand distribution up to this boundary, while allowing them to be arbitrary afterwards. We demonstrate a spectrum of achievability under demand censoring by deriving a natural necessary and sufficient condition under which vanishing regret is an achievable goal. In regimes in which it is not, we exactly characterize the information loss due to censoring: an insurmountable lower bound on the performance of any policy, even when the decision-maker has access to infinitely many demand samples. We then leverage these sharp characterizations to propose a natural robust algorithm that adapts to the historical level of demand censoring. We derive finite-sample guarantees for this algorithm across all possible censoring regimes and show its near-optimality with matching lower bounds (up to polylogarithmic factors). We moreover demonstrate its robust performance via extensive numerical experiments on both synthetic and real-world datasets.
△ Less
Submitted 18 December, 2024; v1 submitted 2 December, 2024;
originally announced December 2024.
-
Exploiting Exogenous Structure for Sample-Efficient Reinforcement Learning
Authors:
Jia Wan,
Sean R. Sinclair,
Devavrat Shah,
Martin J. Wainwright
Abstract:
We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs) where the state space is partitioned into exogenous and endogenous components. Exogenous states evolve stochastically, independent of the agent's actions, while endogenous states evolve deterministically based on both state components and actions. Exo-MDPs are useful for applications including inventory control, portfolio ma…
▽ More
We study Exo-MDPs, a structured class of Markov Decision Processes (MDPs) where the state space is partitioned into exogenous and endogenous components. Exogenous states evolve stochastically, independent of the agent's actions, while endogenous states evolve deterministically based on both state components and actions. Exo-MDPs are useful for applications including inventory control, portfolio management, and ride-sharing. Our first result is structural, establishing a representational equivalence between the classes of discrete MDPs, Exo-MDPs, and discrete linear mixture MDPs. Specifically, any discrete MDP can be represented as an Exo-MDP, and the transition and reward dynamics can be written as linear functions of the exogenous state distribution, showing that Exo-MDPs are instances of linear mixture MDPs. For unobserved exogenous states, we prove a regret upper bound of $O(H^{3/2}d\sqrt{K})$ over $K$ trajectories of horizon $H$, with $d$ as the size of the exogenous state space, and establish nearly-matching lower bounds. Our findings demonstrate how Exo-MDPs decouple sample complexity from action and endogenous state sizes, and we validate our theoretical insights with experiments on inventory control.
△ Less
Submitted 5 February, 2025; v1 submitted 22 September, 2024;
originally announced September 2024.
-
Multi-Objective LQR with Linear Scalarization
Authors:
Ali Jadbabaie,
Devavrat Shah,
Sean R. Sinclair
Abstract:
The framework of decision-making, modeled as a Markov Decision Process (MDP), typically assumes a single objective. However, practical scenarios often involve tradeoffs between multiple objectives. We address this in the Linear Quadratic Regulator (LQR), a canonical continuous, infinite horizon MDP. First, we establish that the Pareto front for LQR is characterized by linear scalarization: a conve…
▽ More
The framework of decision-making, modeled as a Markov Decision Process (MDP), typically assumes a single objective. However, practical scenarios often involve tradeoffs between multiple objectives. We address this in the Linear Quadratic Regulator (LQR), a canonical continuous, infinite horizon MDP. First, we establish that the Pareto front for LQR is characterized by linear scalarization: a convex combination of objectives recovers all tradeoff points, making multi-objective LQR reducible to single-objective problems. This highlights an important instance where linear scalarization suffices for a non-convex problem. Second, we show the Pareto front is smooth, in that an $ε$ perturbation of a scalarization parameter yields an $ε$ approximation to the objective. These results inspire a simple algorithm to approximate the Pareto front via grid search over scalarization parameters, where each optimization problem retains the computational efficiency of single-objective LQR. Lastly, we extend the analysis to certainty equivalence, where unknown dynamics are replaced with estimates.
△ Less
Submitted 15 January, 2025; v1 submitted 8 August, 2024;
originally announced August 2024.
-
Online Fair Allocation of Perishable Resources
Authors:
Siddhartha Banerjee,
Chamsi Hssaine,
Sean R. Sinclair
Abstract:
We consider a practically motivated variant of the canonical online fair allocation problem: a decision-maker has a budget of perishable resources to allocate over a fixed number of rounds. Each round sees a random number of arrivals, and the decision-maker must commit to an allocation for these individuals before moving on to the next round. The goal is to construct a sequence of allocations that…
▽ More
We consider a practically motivated variant of the canonical online fair allocation problem: a decision-maker has a budget of perishable resources to allocate over a fixed number of rounds. Each round sees a random number of arrivals, and the decision-maker must commit to an allocation for these individuals before moving on to the next round. The goal is to construct a sequence of allocations that is envy-free and efficient. Our work makes two important contributions toward this problem: we first derive strong lower bounds on the optimal envy-efficiency trade-off that demonstrate that a decision-maker is fundamentally limited in what she can hope to achieve relative to the no-perishing setting; we then design an algorithm achieving these lower bounds which takes as input $(i)$ a prediction of the perishing order, and $(ii)$ a desired bound on envy. Given the remaining budget in each period, the algorithm uses forecasts of future demand and perishing to adaptively choose one of two carefully constructed guardrail quantities. We demonstrate our algorithm's strong numerical performance - and state-of-the-art, perishing-agnostic algorithms' inefficacy - on simulations calibrated to a real-world dataset.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits
Authors:
Siddhartha Banerjee,
Sean R. Sinclair,
Milind Tambe,
Lily Xu,
Christina Lee Yu
Abstract:
Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data…
▽ More
Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.
△ Less
Submitted 19 March, 2025; v1 submitted 30 September, 2022;
originally announced October 2022.
-
Hindsight Learning for MDPs with Exogenous Inputs
Authors:
Sean R. Sinclair,
Felipe Frujeri,
Ching-An Cheng,
Luke Marshall,
Hugo Barbalho,
Jingling Li,
Jennifer Neville,
Ishai Menache,
Adith Swaminathan
Abstract:
Many resource management problems require sequential decision-making under uncertainty, where the only uncertainty affecting the decision outcomes are exogenous variables outside the control of the decision-maker. We model these problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and design a class of data-efficient algorithms for them termed Hindsight Learning (HL). Our HL algo…
▽ More
Many resource management problems require sequential decision-making under uncertainty, where the only uncertainty affecting the decision outcomes are exogenous variables outside the control of the decision-maker. We model these problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and design a class of data-efficient algorithms for them termed Hindsight Learning (HL). Our HL algorithms achieve data efficiency by leveraging a key insight: having samples of the exogenous variables, past decisions can be revisited in hindsight to infer counterfactual consequences that can accelerate policy improvements. We compare HL against classic baselines in the multi-secretary and airline revenue management problems. We also scale our algorithms to a business-critical cloud resource management problem -- allocating Virtual Machines (VMs) to physical machines, and simulate their performance with real datasets from a large public cloud provider. We find that HL algorithms outperform domain-specific heuristics, as well as state-of-the-art reinforcement learning methods.
△ Less
Submitted 23 October, 2023; v1 submitted 13 July, 2022;
originally announced July 2022.
-
Adaptive Discretization in Online Reinforcement Learning
Authors:
Sean R. Sinclair,
Siddhartha Banerjee,
Christina Lee Yu
Abstract:
Discretization based approaches to solving online reinforcement learning problems have been studied extensively in practice on applications ranging from resource allocation to cache management. Two major questions in designing discretization-based algorithms are how to create the discretization and when to refine it. While there have been several experimental results investigating heuristic soluti…
▽ More
Discretization based approaches to solving online reinforcement learning problems have been studied extensively in practice on applications ranging from resource allocation to cache management. Two major questions in designing discretization-based algorithms are how to create the discretization and when to refine it. While there have been several experimental results investigating heuristic solutions to these questions, there has been little theoretical treatment. In this paper we provide a unified theoretical analysis of tree-based hierarchical partitioning methods for online reinforcement learning, providing model-free and model-based algorithms. We show how our algorithms are able to take advantage of inherent structure of the problem by providing guarantees that scale with respect to the 'zooming dimension' instead of the ambient dimension, an instance-dependent quantity measuring the benignness of the optimal $Q_h^\star$ function.
Many applications in computing systems and operations research requires algorithms that compete on three facets: low sample complexity, mild storage requirements, and low computational burden. Our algorithms are easily adapted to operating constraints, and our theory provides explicit bounds across each of the three facets. This motivates its use in practical applications as our approach automatically adapts to underlying problem structure even when very little is known a priori about the system.
△ Less
Submitted 10 October, 2022; v1 submitted 29 October, 2021;
originally announced October 2021.
-
Sequential Fair Allocation: Achieving the Optimal Envy-Efficiency Tradeoff Curve
Authors:
Sean R. Sinclair,
Gauri Jain,
Siddhartha Banerjee,
Christina Lee Yu
Abstract:
We consider the problem of dividing limited resources to individuals arriving over $T$ rounds. Each round has a random number of individuals arrive, and individuals can be characterized by their type (i.e. preferences over the different resources). A standard notion of 'fairness' in this setting is that an allocation simultaneously satisfy envy-freeness and efficiency. The former is an individual…
▽ More
We consider the problem of dividing limited resources to individuals arriving over $T$ rounds. Each round has a random number of individuals arrive, and individuals can be characterized by their type (i.e. preferences over the different resources). A standard notion of 'fairness' in this setting is that an allocation simultaneously satisfy envy-freeness and efficiency. The former is an individual guarantee, requiring that each agent prefers their own allocation over the allocation of any other; in contrast, efficiency is a global property, requiring that the allocations clear the available resources. For divisible resources, when the number of individuals of each type are known upfront, the above desiderata are simultaneously achievable for a large class of utility functions. However, in an online setting when the number of individuals of each type are only revealed round by round, no policy can guarantee these desiderata simultaneously, and hence the best one can do is to try and allocate so as to approximately satisfy the two properties.
We show that in the online setting, the two desired properties (envy-freeness and efficiency) are in direct contention, in that any algorithm achieving additive counterfactual envy-freeness up to a factor of $L_T$ necessarily suffers a efficiency loss of at least $1 / L_T$. We complement this uncertainty principle with a simple algorithm, HopeGuardrail, which allocates resources based on an adaptive threshold policy and is able to achieve any fairness-efficiency point on this frontier. In simulation results, our algorithm provides allocations close to the optimal fair solution in hindsight, motivating its use in practical applications as the algorithm is able to adapt to any desired fairness efficiency trade-off.
△ Less
Submitted 29 September, 2022; v1 submitted 11 May, 2021;
originally announced May 2021.
-
Sequential Fair Allocation of Limited Resources under Stochastic Demands
Authors:
Sean R. Sinclair,
Gauri Jain,
Siddhartha Banerjee,
Christina Lee Yu
Abstract:
We consider the problem of dividing limited resources between a set of agents arriving sequentially with unknown (stochastic) utilities. Our goal is to find a fair allocation - one that is simultaneously Pareto-efficient and envy-free. When all utilities are known upfront, the above desiderata are simultaneously achievable (and efficiently computable) for a large class of utility functions. In a s…
▽ More
We consider the problem of dividing limited resources between a set of agents arriving sequentially with unknown (stochastic) utilities. Our goal is to find a fair allocation - one that is simultaneously Pareto-efficient and envy-free. When all utilities are known upfront, the above desiderata are simultaneously achievable (and efficiently computable) for a large class of utility functions. In a sequential setting, however, no policy can guarantee these desiderata simultaneously for all possible utility realizations. A natural online fair allocation objective is to minimize the deviation of each agent's final allocation from their fair allocation in hindsight. This translates into simultaneous guarantees for both Pareto-efficiency and envy-freeness. However, the resulting dynamic program has state-space which is exponential in the number of agents. We propose a simple policy, HopeOnline, that instead aims to `match' the ex-post fair allocation vector using the current available resources and `predicted' histogram of future utilities. We demonstrate the effectiveness of our policy compared to other heurstics on a dataset inspired by mobile food-bank allocations.
△ Less
Submitted 9 July, 2022; v1 submitted 29 November, 2020;
originally announced November 2020.
-
Adaptive Discretization for Model-Based Reinforcement Learning
Authors:
Sean R. Sinclair,
Tianyu Wang,
Gauri Jain,
Siddhartha Banerjee,
Christina Lee Yu
Abstract:
We introduce the technique of adaptive discretization to design an efficient model-based episodic reinforcement learning algorithm in large (potentially continuous) state-action spaces. Our algorithm is based on optimistic one-step value iteration extended to maintain an adaptive discretization of the space. From a theoretical perspective we provide worst-case regret bounds for our algorithm which…
▽ More
We introduce the technique of adaptive discretization to design an efficient model-based episodic reinforcement learning algorithm in large (potentially continuous) state-action spaces. Our algorithm is based on optimistic one-step value iteration extended to maintain an adaptive discretization of the space. From a theoretical perspective we provide worst-case regret bounds for our algorithm which are competitive compared to the state-of-the-art model-based algorithms. Moreover, our bounds are obtained via a modular proof technique which can potentially extend to incorporate additional structure on the problem.
From an implementation standpoint, our algorithm has much lower storage and computational requirements due to maintaining a more efficient partition of the state and action spaces. We illustrate this via experiments on several canonical control problems, which shows that our algorithm empirically performs significantly better than fixed discretization in terms of both faster convergence and lower memory usage. Interestingly, we observe empirically that while fixed-discretization model-based algorithms vastly outperform their model-free counterparts, the two achieve comparable performance with adaptive discretization.
△ Less
Submitted 23 October, 2020; v1 submitted 1 July, 2020;
originally announced July 2020.
-
Adaptive Discretization for Episodic Reinforcement Learning in Metric Spaces
Authors:
Sean R. Sinclair,
Siddhartha Banerjee,
Christina Lee Yu
Abstract:
We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel $Q$-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff e…
▽ More
We present an efficient algorithm for model-free episodic reinforcement learning on large (potentially continuous) state-action spaces. Our algorithm is based on a novel $Q$-learning policy with adaptive data-driven discretization. The central idea is to maintain a finer partition of the state-action space in regions which are frequently visited in historical trajectories, and have higher payoff estimates. We demonstrate how our adaptive partitions take advantage of the shape of the optimal $Q$-function and the joint space, without sacrificing the worst-case performance. In particular, we recover the regret guarantees of prior algorithms for continuous state-action spaces, which additionally require either an optimal discretization as input, and/or access to a simulation oracle. Moreover, experiments demonstrate how our algorithm automatically adapts to the underlying structure of the problem, resulting in much better performance compared both to heuristics and $Q$-learning with uniform discretization.
△ Less
Submitted 31 October, 2019; v1 submitted 17 October, 2019;
originally announced October 2019.
-
Normal and pathological dynamics of platelets in humans
Authors:
Gabriel P. Langlois,
Morgan Craig,
Antony R. Humphries,
Michael C. Mackey,
Joseph M. Mahaffy,
Jacques Bélair,
Thibault Moulin,
Sean R. Sinclair,
Liangliang Wang
Abstract:
We develop a comprehensive mathematical model of platelet, megakaryocyte, and thrombopoietin dynamics in humans. We show that there is a single stationary solution that can undergo a Hopf bifurcation, and use this information to investigate both normal and pathological platelet production, specifically cyclic thrombocytopenia. Carefully estimating model parameters from laboratory and clinical data…
▽ More
We develop a comprehensive mathematical model of platelet, megakaryocyte, and thrombopoietin dynamics in humans. We show that there is a single stationary solution that can undergo a Hopf bifurcation, and use this information to investigate both normal and pathological platelet production, specifically cyclic thrombocytopenia. Carefully estimating model parameters from laboratory and clinical data, we then argue that a subset of parameters are involved in the genesis of cyclic thrombocytopenia based on clinical information. We provide excellent model fits to the existing data for both platelet counts and thrombopoietin levels by changing six parameters that have physiological correlates. Our results indicate that the primary change in cyclic thrombocytopenia is a major interference with or destruction of the thrombopoietin receptor with secondary changes in other processes, including immune-mediated destruction of platelets and megakaryocyte deficiency and failure in platelet production. This study makes a major contribution to the understanding of the origin of cyclic thrombopoietin as well as significantly extending the modeling of thrombopoiesis.
△ Less
Submitted 26 January, 2017; v1 submitted 29 July, 2016;
originally announced August 2016.