Skip to main content

Showing 1–30 of 30 results for author: Zimmert, J

.
  1. arXiv:2506.02980  [pdf, ps, other

    stat.ML cs.LG

    Non-stationary Bandit Convex Optimization: A Comprehensive Study

    Authors: Xiaoqi Liu, Dorian Baudry, Julian Zimmert, Patrick Rebeschini, Arya Akhavan

    Abstract: Bandit Convex Optimization is a fundamental class of sequential decision-making problems, where the learner selects actions from a continuous domain and observes a loss (but not its gradient) at only one point per round. We study this problem in non-stationary environments, and aim to minimize the regret under three standard measures of non-stationarity: the number of switches $S$ in the comparato… ▽ More

    Submitted 3 June, 2025; originally announced June 2025.

    Comments: 32 pages, 1 figure

  2. arXiv:2502.05974  [pdf, ps, other

    cs.LG stat.ML

    Decision Making in Hybrid Environments: A Model Aggregation Approach

    Authors: Haolin Liu, Chen-Yu Wei, Julian Zimmert

    Abstract: Recent work by Foster et al. (2021, 2022, 2023b) and Xu and Zeevi (2023) developed the framework of decision estimation coefficient (DEC) that characterizes the complexity of general online decision making problems and provides a general algorithm design principle. These works, however, either focus on the pure stochastic regime where the world remains fixed over time, or the pure adversarial regi… ▽ More

    Submitted 30 April, 2025; v1 submitted 9 February, 2025; originally announced February 2025.

  3. arXiv:2502.02430  [pdf, other

    stat.ML cs.IR cs.LG

    A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

    Authors: Róbert Busa-Fekete, Julian Zimmert, András György, Linhai Qiu, Tzu-Wei Sung, Hao Shen, Hyomin Choi, Sharmila Subramaniam, Li Xiao

    Abstract: Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we… ▽ More

    Submitted 20 March, 2025; v1 submitted 4 February, 2025; originally announced February 2025.

  4. arXiv:2411.06739  [pdf, other

    cs.LG

    Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

    Authors: Haolin Liu, Zakaria Mhammedi, Chen-Yu Wei, Julian Zimmert

    Abstract: We consider regret minimization in low-rank MDPs with fixed transition and adversarial losses. Previous work has investigated this problem under either full-information loss feedback with unknown transitions (Zhao et al., 2024), or bandit loss feedback with known transition (Foster et al., 2022). First, we improve the $poly(d, A, H)T^{5/6}$ regret bound of Zhao et al. (2024) to… ▽ More

    Submitted 11 November, 2024; originally announced November 2024.

    Comments: NeurIPS 2024

  5. arXiv:2405.06480  [pdf, ps, other

    cs.LG cs.GT

    Incentive-compatible Bandits: Importance Weighting No More

    Authors: Julian Zimmert, Teodor V. Marinov

    Abstract: We study the problem of incentive-compatible online learning with bandit feedback. In this class of problems, the experts are self-interested agents who might misrepresent their preferences with the goal of being selected most often. The goal is to devise algorithms which are simultaneously incentive-compatible, that is the experts are incentivised to report their true preferences, and have no reg… ▽ More

    Submitted 10 May, 2024; originally announced May 2024.

  6. arXiv:2401.01857  [pdf, ps, other

    cs.LG stat.ML

    Optimal cross-learning for contextual bandits with unknown context distributions

    Authors: Jon Schneider, Julian Zimmert

    Abstract: We consider the problem of designing contextual bandit algorithms in the ``cross-learning'' setting of Balseiro et al., where the learner observes the loss for the action they play in all possible contexts, not just the context of the current round. We specifically consider the setting where losses are chosen adversarially and contexts are sampled i.i.d. from an unknown distribution. In this setti… ▽ More

    Submitted 3 January, 2024; originally announced January 2024.

    Comments: Appeared at NeurIPS 2023

  7. arXiv:2310.11550  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

    Authors: Haolin Liu, Chen-Yu Wei, Julian Zimmert

    Abstract: We study online reinforcement learning in linear Markov decision processes with adversarial losses and bandit feedback, without prior knowledge on transitions or access to simulators. We introduce two algorithms that achieve improved regret performance compared to existing approaches. The first algorithm, although computationally inefficient, ensures a regret of… ▽ More

    Submitted 17 October, 2023; originally announced October 2023.

  8. arXiv:2309.00814  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Bypassing the Simulator: Near-Optimal Adversarial Linear Contextual Bandits

    Authors: Haolin Liu, Chen-Yu Wei, Julian Zimmert

    Abstract: We consider the adversarial linear contextual bandit problem, where the loss vectors are selected fully adversarially and the per-round action set (i.e. the context) is drawn from a fixed distribution. Existing methods for this problem either require access to a simulator to generate free i.i.d. contexts, achieve a sub-optimal regret no better than $\widetilde{O}(T^{\frac{5}{6}})$, or are computat… ▽ More

    Submitted 1 September, 2023; originally announced September 2023.

  9. arXiv:2308.10675  [pdf, ps, other

    cs.LG stat.ML

    A Best-of-both-worlds Algorithm for Bandits with Delayed Feedback with Robustness to Excessive Delays

    Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

    Abstract: We propose a new best-of-both-worlds algorithm for bandits with variably delayed feedback. In contrast to prior work, which required prior knowledge of the maximal delay $d_{\mathrm{max}}$ and had a linear dependence of the regret on it, our algorithm can tolerate arbitrary excessive delays up to order $T$ (where $T$ is the time horizon). The algorithm is based on three technical innovations, whic… ▽ More

    Submitted 27 May, 2024; v1 submitted 21 August, 2023; originally announced August 2023.

  10. arXiv:2302.09739  [pdf, ps, other

    cs.LG cs.AI stat.ML

    A Blackbox Approach to Best of Both Worlds in Bandits and Beyond

    Authors: Christoph Dann, Chen-Yu Wei, Julian Zimmert

    Abstract: Best-of-both-worlds algorithms for online learning which achieve near-optimal regret in both the adversarial and the stochastic regimes have received growing attention recently. Existing techniques often require careful adaptation to every new problem setup, including specialised potentials and careful tuning of algorithm parameters. Yet, in domains such as linear bandits, it is still unknown if t… ▽ More

    Submitted 19 February, 2023; originally announced February 2023.

  11. arXiv:2302.09408  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Best of Both Worlds Policy Optimization

    Authors: Christoph Dann, Chen-Yu Wei, Julian Zimmert

    Abstract: Policy optimization methods are popular reinforcement learning algorithms in practice. Recent works have built theoretical foundation for them by proving $\sqrt{T}$ regret bounds even when the losses are adversarial. Such bounds are tight in the worst case but often overly pessimistic. In this work, we show that in tabular Markov decision processes (MDPs), by properly designing the regularizer, th… ▽ More

    Submitted 18 February, 2023; originally announced February 2023.

  12. arXiv:2301.12942  [pdf, ps, other

    cs.LG cs.AI stat.ML

    Refined Regret for Adversarial MDPs with Linear Function Approximation

    Authors: Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

    Abstract: We consider learning in an adversarial Markov Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes and the state space can be arbitrarily large. We assume that the Q-function of any policy is linear in some known features, that is, a linear function approximation exists. The best existing regret upper bound for this setting (Luo et al., 2021) is of order… ▽ More

    Submitted 1 June, 2023; v1 submitted 30 January, 2023; originally announced January 2023.

    Comments: Accepted to ICML 2023

  13. arXiv:2210.09255  [pdf, ps, other

    cs.LG stat.ML

    A Unified Algorithm for Stochastic Path Problems

    Authors: Christoph Dann, Chen-Yu Wei, Julian Zimmert

    Abstract: We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP)… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

  14. arXiv:2208.10904  [pdf, ps, other

    cs.LG

    A Provably Efficient Model-Free Posterior Sampling Method for Episodic Reinforcement Learning

    Authors: Christoph Dann, Mehryar Mohri, Tong Zhang, Julian Zimmert

    Abstract: Thompson Sampling is one of the most effective methods for contextual bandits and has been generalized to posterior sampling for certain MDP settings. However, existing posterior sampling methods for reinforcement learning are limited by being model-based or lack worst-case theoretical guarantees beyond linear MDPs. This paper proposes a new model-free formulation of posterior sampling that applie… ▽ More

    Submitted 23 August, 2022; originally announced August 2022.

    Journal ref: Dann C, Mohri M, Zhang T, Zimmert J. A provably efficient model-free posterior sampling method for episodic reinforcement learning. Advances in Neural Information Processing Systems. 2021 Dec 6;34:12040-51

  15. arXiv:2206.14906  [pdf, ps, other

    cs.LG stat.ML

    A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback

    Authors: Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

    Abstract: We present a modified tuning of the algorithm of Zimmert and Seldin [2020] for adversarial multiarmed bandits with delayed feedback, which in addition to the minimax optimal adversarial regret guarantee shown by Zimmert and Seldin simultaneously achieves a near-optimal regret guarantee in the stochastic setting with fixed delays. Specifically, the adversarial regret guarantee is… ▽ More

    Submitted 29 June, 2022; originally announced June 2022.

  16. arXiv:2206.10022  [pdf, other

    cs.LG

    Stochastic Online Learning with Feedback Graphs: Finite-Time and Asymptotic Optimality

    Authors: Teodor V. Marinov, Mehryar Mohri, Julian Zimmert

    Abstract: We revisit the problem of stochastic online learning with feedback graphs, with the goal of devising algorithms that are optimal, up to constants, both asymptotically and in finite time. We show that, surprisingly, the notion of optimal finite-time regret is not a uniquely defined property in this context and that, in general, it is decoupled from the asymptotic rate. We discuss alternative choice… ▽ More

    Submitted 20 June, 2022; originally announced June 2022.

  17. arXiv:2202.02765  [pdf, ps, other

    cs.LG stat.ML

    Pushing the Efficiency-Regret Pareto Frontier for Online Learning of Portfolios and Quantum States

    Authors: Julian Zimmert, Naman Agarwal, Satyen Kale

    Abstract: We revisit the classical online portfolio selection problem. It is widely assumed that a trade-off between computational complexity and regret is unavoidable, with Cover's Universal Portfolios algorithm, SOFT-BAYES and ADA-BARRONS currently constituting its state-of-the-art Pareto frontier. In this paper, we present the first efficient algorithm, BISONS, that obtains polylogarithmic regret with me… ▽ More

    Submitted 6 February, 2022; originally announced February 2022.

  18. arXiv:2110.13282  [pdf, ps, other

    cs.LG

    The Pareto Frontier of model selection for general Contextual Bandits

    Authors: Teodor V. Marinov, Julian Zimmert

    Abstract: Recent progress in model selection raises the question of the fundamental limits of these techniques. Under specific scrutiny has been model selection for general contextual bandits with nested policy classes, resulting in a COLT2020 open problem. It asks whether it is possible to obtain simultaneously the optimal single algorithm guarantees over all policies in a nested sequence of policy classes… ▽ More

    Submitted 25 October, 2021; originally announced October 2021.

  19. arXiv:2110.03580  [pdf, ps, other

    cs.LG stat.ML

    A Model Selection Approach for Corruption Robust Reinforcement Learning

    Authors: Chen-Yu Wei, Christoph Dann, Julian Zimmert

    Abstract: We develop a model selection approach to tackle reinforcement learning with adversarial corruption in both transition and reward. For finite-horizon tabular MDPs, without prior knowledge on the total amount of corruption, our algorithm achieves a regret bound of $\widetilde{\mathcal{O}}(\min\{\frac{1}Δ, \sqrt{T}\}+C)$ where $T$ is the number of episodes, $C$ is the total amount of corruption, and… ▽ More

    Submitted 29 December, 2024; v1 submitted 7 October, 2021; originally announced October 2021.

  20. arXiv:2110.03020  [pdf, ps, other

    cs.LG stat.ML

    Efficient Methods for Online Multiclass Logistic Regression

    Authors: Naman Agarwal, Satyen Kale, Julian Zimmert

    Abstract: Multiclass logistic regression is a fundamental task in machine learning with applications in classification and boosting. Previous work (Foster et al., 2018) has highlighted the importance of improper predictors for achieving "fast rates" in the online multiclass logistic regression problem without suffering exponentially from secondary problem parameters, such as the norm of the predictors in th… ▽ More

    Submitted 10 October, 2021; v1 submitted 6 October, 2021; originally announced October 2021.

  21. arXiv:2107.05745  [pdf, ps, other

    cs.LG stat.ML

    Adapting to Misspecification in Contextual Bandits

    Authors: Dylan J. Foster, Claudio Gentile, Mehryar Mohri, Julian Zimmert

    Abstract: A major research direction in contextual bandits is to develop algorithms that are computationally efficient, yet support flexible, general-purpose function approximation. Algorithms based on modeling rewards have shown strong empirical performance, but typically require a well-specified model, and can fail when this assumption does not hold. Can we design algorithms that are efficient and flexibl… ▽ More

    Submitted 12 July, 2021; originally announced July 2021.

    Comments: Appeared at NeurIPS 2020

  22. arXiv:2107.01264  [pdf, other

    cs.LG

    Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning

    Authors: Christoph Dann, Teodor V. Marinov, Mehryar Mohri, Julian Zimmert

    Abstract: We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy.… ▽ More

    Submitted 26 October, 2021; v1 submitted 2 July, 2021; originally announced July 2021.

  23. arXiv:2003.01704  [pdf, other

    cs.LG stat.ML

    Model Selection in Contextual Stochastic Bandit Problems

    Authors: Aldo Pacchiano, My Phan, Yasin Abbasi-Yadkori, Anup Rao, Julian Zimmert, Tor Lattimore, Csaba Szepesvari

    Abstract: We study bandit model selection in stochastic environments. Our approach relies on a meta-algorithm that selects between candidate base algorithms. We develop a meta-algorithm-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial meta-algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that… ▽ More

    Submitted 4 December, 2022; v1 submitted 3 March, 2020; originally announced March 2020.

    Comments: 33 main pages, 15 appendix pages

  24. arXiv:2002.12014  [pdf, other

    cs.LG stat.ML

    Online Learning for Active Cache Synchronization

    Authors: Andrey Kolobov, Sébastien Bubeck, Julian Zimmert

    Abstract: Existing multi-armed bandit (MAB) models make two implicit assumptions: an arm generates a payoff only when it is played, and the agent observes every payoff that is generated. This paper introduces synchronization bandits, a MAB variant where all arms generate costs at all times, but the agent observes an arm's instantaneous cost only when the arm is played. Synchronization MABs are inspired by o… ▽ More

    Submitted 21 August, 2020; v1 submitted 27 February, 2020; originally announced February 2020.

  25. arXiv:1910.06054  [pdf, ps, other

    cs.LG stat.ML

    An Optimal Algorithm for Adversarial Bandits with Arbitrary Delays

    Authors: Julian Zimmert, Yevgeny Seldin

    Abstract: We propose a new algorithm for adversarial multi-armed bandits with unrestricted delays. The algorithm is based on a novel hybrid regularizer applied in the Follow the Regularized Leader (FTRL) framework. It achieves $\mathcal{O}(\sqrt{kn}+\sqrt{D\log(k)})$ regret guarantee, where $k$ is the number of arms, $n$ is the number of rounds, and $D$ is the total delay. The result matches the lower bound… ▽ More

    Submitted 16 June, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

  26. arXiv:1905.11817  [pdf, other

    cs.LG stat.ML

    Connections Between Mirror Descent, Thompson Sampling and the Information Ratio

    Authors: Julian Zimmert, Tor Lattimore

    Abstract: The information-theoretic analysis by Russo and Van Roy (2014) in combination with minimax duality has proved a powerful tool for the analysis of online learning algorithms in full and partial information settings. In most applications there is a tantalising similarity to the classical analysis based on mirror descent. We make a formal connection, showing that the information-theoretic bounds in m… ▽ More

    Submitted 28 May, 2019; originally announced May 2019.

  27. arXiv:1901.08779  [pdf, other

    cs.LG stat.ML

    Beating Stochastic and Adversarial Semi-bandits Optimally and Simultaneously

    Authors: Julian Zimmert, Haipeng Luo, Chen-Yu Wei

    Abstract: We develop the first general semi-bandit algorithm that simultaneously achieves $\mathcal{O}(\log T)$ regret for stochastic environments and $\mathcal{O}(\sqrt{T})$ regret for adversarial environments without knowledge of the regime or the number of rounds $T$. The leading problem-dependent constants of our bounds are not only optimal in some worst-case sense studied previously, but also optimal f… ▽ More

    Submitted 26 September, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

  28. arXiv:1807.07623  [pdf, other

    cs.LG stat.ML

    Tsallis-INF: An Optimal Algorithm for Stochastic and Adversarial Bandits

    Authors: Julian Zimmert, Yevgeny Seldin

    Abstract: We derive an algorithm that achieves the optimal (within constants) pseudo-regret in both adversarial and stochastic multi-armed bandits without prior knowledge of the regime and time horizon. The algorithm is based on online mirror descent (OMD) with Tsallis entropy regularization with power $α=1/2$ and reduced-variance loss estimators. More generally, we define an adversarial regime with a self-… ▽ More

    Submitted 2 March, 2022; v1 submitted 19 July, 2018; originally announced July 2018.

  29. arXiv:1807.01488  [pdf, ps, other

    cs.LG stat.ML

    Factored Bandits

    Authors: Julian Zimmert, Yevgeny Seldin

    Abstract: We introduce the factored bandits model, which is a framework for learning with limited (bandit) feedback, where actions can be decomposed into a Cartesian product of atomic actions. Factored bandits incorporate rank-1 bandits as a special case, but significantly relax the assumptions on the form of the reward function. We provide an anytime algorithm for stochastic factored bandits and up to cons… ▽ More

    Submitted 29 October, 2018; v1 submitted 4 July, 2018; originally announced July 2018.

  30. Distributed Optimization of Multi-Class SVMs

    Authors: Maximilian Alber, Julian Zimmert, Urun Dogan, Marius Kloft

    Abstract: Training of one-vs.-rest SVMs can be parallelized over the number of classes in a straight forward way. Given enough computational resources, one-vs.-rest SVMs can thus be trained on data involving a large number of classes. The same cannot be stated, however, for the so-called all-in-one SVMs, which require solving a quadratic program of size quadratically in the number of classes. We develop dis… ▽ More

    Submitted 8 December, 2016; v1 submitted 25 November, 2016; originally announced November 2016.