Skip to main content

Showing 1–50 of 59 results for author: Kveton, B

Searching in archive stat. Search in all archives.
.
  1. arXiv:2505.14826  [pdf, other

    cs.LG cs.CL stat.ML

    FisherSFT: Data-Efficient Supervised Fine-Tuning of Language Models Using Information Gain

    Authors: Rohan Deb, Kiran Thekumparampil, Kousha Kalantari, Gaurush Hiranandani, Shoham Sabach, Branislav Kveton

    Abstract: Supervised fine-tuning (SFT) is a standard approach to adapting large language models (LLMs) to new domains. In this work, we improve the statistical efficiency of SFT by selecting an informative subset of training examples. Specifically, for a fixed budget of training examples, which determines the computational cost of fine-tuning, we determine the most informative ones. The key idea in our meth… ▽ More

    Submitted 20 May, 2025; originally announced May 2025.

  2. arXiv:2503.01076  [pdf, other

    cs.LG stat.ML

    Active Learning for Direct Preference Optimization

    Authors: Branislav Kveton, Xintong Li, Julian McAuley, Ryan Rossi, Jingbo Shang, Junda Wu, Tong Yu

    Abstract: Direct preference optimization (DPO) is a form of reinforcement learning from human feedback (RLHF) where the policy is learned directly from preferential feedback. Although many models of human preferences exist, the critical task of selecting the most informative feedback for training them is under-explored. We propose an active learning framework for DPO, which can be applied to collect human f… ▽ More

    Submitted 2 March, 2025; originally announced March 2025.

  3. arXiv:2412.19396  [pdf, other

    cs.LG cs.AI cs.IT math.OC stat.ML

    Comparing Few to Rank Many: Active Human Preference Learning using Randomized Frank-Wolfe

    Authors: Kiran Koshy Thekumparampil, Gaurush Hiranandani, Kousha Kalantari, Shoham Sabach, Branislav Kveton

    Abstract: We study learning of human preferences from a limited comparison feedback. This task is ubiquitous in machine learning. Its applications such as reinforcement learning from human feedback, have been transformational. We formulate this problem as learning a Plackett-Luce model over a universe of $N$ choices from $K$-way comparison feedback, where typically $K \ll N$. Our solution is the D-optimal d… ▽ More

    Submitted 26 December, 2024; originally announced December 2024.

    Comments: Submitted to AISTATS 2025 on October 10, 2024

  4. arXiv:2410.03919  [pdf, other

    cs.LG stat.ML

    Online Posterior Sampling with a Diffusion Prior

    Authors: Branislav Kveton, Boris Oreshkin, Youngsuk Park, Aniket Deshmukh, Rui Song

    Abstract: Posterior sampling in contextual bandits with a Gaussian prior can be implemented exactly or approximately using the Laplace approximation. The Gaussian prior is computationally efficient but it cannot describe complex distributions. In this work, we propose approximate posterior sampling algorithms for contextual bandits with a diffusion model prior. The key idea is to sample from a chain of appr… ▽ More

    Submitted 4 October, 2024; originally announced October 2024.

    Comments: Proceedings of the 38th Conference on Neural Information Processing Systems

  5. arXiv:2406.10030  [pdf, other

    cs.LG stat.ML

    Off-Policy Evaluation from Logged Human Feedback

    Authors: Aniruddha Bhargava, Lalit Jain, Branislav Kveton, Ge Liu, Subhojyoti Mukherjee

    Abstract: Learning from human feedback has been central to recent advances in artificial intelligence and machine learning. Since the collection of human feedback is costly, a natural question to ask is if the new feedback always needs to collected. Or could we evaluate a new model with the human feedback on responses of another model? This motivates us to study off-policy evaluation from logged human feedb… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

  6. arXiv:2310.18617  [pdf, other

    cs.LG stat.ML

    Pessimistic Off-Policy Multi-Objective Optimization

    Authors: Shima Alizadeh, Aniruddha Bhargava, Karthick Gopalswamy, Lalit Jain, Branislav Kveton, Ge Liu

    Abstract: Multi-objective optimization is a type of decision making problems where multiple conflicting objectives are optimized. We study offline optimization of multi-objective policies from data collected by an existing policy. We propose a pessimistic estimator for the multi-objective policy values that can be easily plugged into existing formulas for hypervolume computation and optimized. The estimator… ▽ More

    Submitted 28 October, 2023; originally announced October 2023.

  7. arXiv:2306.09136  [pdf, other

    cs.LG stat.ML

    Finite-Time Logarithmic Bayes Regret Upper Bounds

    Authors: Alexia Atsidakou, Branislav Kveton, Sumeet Katariya, Constantine Caramanis, Sujay Sanghavi

    Abstract: We derive the first finite-time logarithmic Bayes regret upper bounds for Bayesian bandits. In a multi-armed bandit, we obtain $O(c_Δ\log n)$ and $O(c_h \log^2 n)$ upper bounds for an upper confidence bound algorithm, where $c_h$ and $c_Δ$ are constants depending on the prior distribution and the gaps of bandit instances sampled from it, respectively. The latter bound asymptotically matches the lo… ▽ More

    Submitted 21 January, 2024; v1 submitted 15 June, 2023; originally announced June 2023.

  8. arXiv:2306.07549  [pdf, other

    cs.LG stat.ML

    Fixed-Budget Best-Arm Identification with Heterogeneous Reward Variances

    Authors: Anusha Lalitha, Kousha Kalantari, Yifei Ma, Anoop Deoras, Branislav Kveton

    Abstract: We study the problem of best-arm identification (BAI) in the fixed-budget setting with heterogeneous reward variances. We propose two variance-adaptive BAI algorithms for this setting: SHVar for known reward variances and SHAdaVar for unknown reward variances. Our algorithms rely on non-uniform budget allocations among the arms where the arms with higher reward variances are pulled more often than… ▽ More

    Submitted 13 June, 2023; originally announced June 2023.

  9. arXiv:2301.05182  [pdf, other

    cs.LG cs.AI stat.ML

    Thompson Sampling with Diffusion Generative Prior

    Authors: Yu-Guan Hsieh, Shiva Prasad Kasiviswanathan, Branislav Kveton, Patrick Blöbaum

    Abstract: In this work, we initiate the idea of using denoising diffusion models to learn priors for online decision making problems. Our special focus is on the meta-learning for bandit framework, with the goal of learning a strategy that performs well across bandit tasks of a same class. To this end, we train a diffusion model that learns the underlying task distribution and combine Thompson sampling with… ▽ More

    Submitted 30 January, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

  10. arXiv:2211.08572  [pdf, other

    cs.LG stat.ML

    Bayesian Fixed-Budget Best-Arm Identification

    Authors: Alexia Atsidakou, Sumeet Katariya, Sujay Sanghavi, Branislav Kveton

    Abstract: Fixed-budget best-arm identification (BAI) is a bandit problem where the agent maximizes the probability of identifying the optimal arm within a fixed budget of observations. In this work, we study this problem in the Bayesian setting. We propose a Bayesian elimination algorithm and derive an upper bound on its probability of misidentifying the optimal arm. The bound reflects the quality of the pr… ▽ More

    Submitted 15 June, 2023; v1 submitted 15 November, 2022; originally announced November 2022.

  11. arXiv:2210.14483  [pdf, other

    cs.LG stat.ML

    Robust Contextual Linear Bandits

    Authors: Rong Zhu, Branislav Kveton

    Abstract: Model misspecification is a major consideration in applications of statistical methods and machine learning. However, it is often neglected in contextual bandits. This paper studies a common form of misspecification, an inter-arm heterogeneity that is not captured by context. To address this issue, we assume that the heterogeneity arises due to arm-specific random variables, which can be learned.… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

  12. arXiv:2206.04091  [pdf, other

    stat.ML cs.LG

    Uplifting Bandits

    Authors: Yu-Guan Hsieh, Shiva Prasad Kasiviswanathan, Branislav Kveton

    Abstract: We introduce a multi-armed bandit model where the reward is a sum of multiple random variables, and each action only alters the distributions of some of them. After each action, the agent observes the realizations of all the variables. This model is motivated by marketing campaigns and recommender systems, where the variables represent outcomes on individual customers, such as clicks. We propose U… ▽ More

    Submitted 8 June, 2022; originally announced June 2022.

  13. arXiv:2205.15124  [pdf, other

    cs.LG stat.ML

    Mixed-Effect Thompson Sampling

    Authors: Imad Aouali, Branislav Kveton, Sumeet Katariya

    Abstract: A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. To explore efficiently using this structure, we… ▽ More

    Submitted 5 March, 2023; v1 submitted 30 May, 2022; originally announced May 2022.

  14. arXiv:2202.12888  [pdf, other

    cs.LG cs.AI stat.ML

    Meta-Learning for Simple Regret Minimization

    Authors: Mohammadjavad Azizi, Branislav Kveton, Mohammad Ghavamzadeh, Sumeet Katariya

    Abstract: We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm h… ▽ More

    Submitted 4 July, 2023; v1 submitted 25 February, 2022; originally announced February 2022.

  15. arXiv:2202.01454  [pdf, other

    cs.LG stat.ML

    Deep Hierarchy in Bandits

    Authors: Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh

    Abstract: Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented… ▽ More

    Submitted 3 February, 2022; originally announced February 2022.

  16. arXiv:2109.07743  [pdf, other

    cs.LG stat.ML

    Optimal Probing with Statistical Guarantees for Network Monitoring at Scale

    Authors: Muhammad Jehangir Amjad, Christophe Diot, Dimitris Konomis, Branislav Kveton, Augustin Soule, Xiaolong Yang

    Abstract: Cloud networks are difficult to monitor because they grow rapidly and the budgets for monitoring them are limited. We propose a framework for estimating network metrics, such as latency and packet loss, with guarantees on estimation errors for a fixed monitoring budget. Our proposed algorithms produce a distribution of probes across network paths, which we then monitor; and are based on A- and E-o… ▽ More

    Submitted 16 September, 2021; originally announced September 2021.

  17. arXiv:2106.12200  [pdf, other

    cs.LG stat.ML

    Random Effect Bandits

    Authors: Rong Zhu, Branislav Kveton

    Abstract: This paper studies regret minimization in a multi-armed bandit. It is well known that side information, such as the prior distribution of arm means in Thompson sampling, can improve the statistical efficiency of the bandit algorithm. While the prior is a blessing when correctly specified, it is a curse when misspecified. To address this issue, we introduce the assumption of a random-effect model t… ▽ More

    Submitted 4 March, 2022; v1 submitted 23 June, 2021; originally announced June 2021.

    Comments: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics

  18. arXiv:2106.05608  [pdf, other

    cs.LG cs.AI stat.ML

    Thompson Sampling with a Mixture Prior

    Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Mohammad Ghavamzadeh, Craig Boutilier

    Abstract: We study Thompson sampling (TS) in online decision making, where the uncertain environment is sampled from a mixture distribution. This is relevant in multi-task learning, where a learning agent faces different classes of problems. We incorporate this structure in a natural way by initializing TS with a mixture prior, and call the resulting algorithm MixTS. To analyze MixTS, we develop a novel and… ▽ More

    Submitted 5 March, 2022; v1 submitted 10 June, 2021; originally announced June 2021.

    Comments: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics

  19. arXiv:2103.04387  [pdf, other

    cs.LG stat.ML

    CORe: Capitalizing On Rewards in Bandit Exploration

    Authors: Nan Wang, Branislav Kveton, Maryam Karimzadehgan

    Abstract: We propose a bandit algorithm that explores purely by randomizing its past observations. In particular, the sufficient optimism in the mean reward estimates is achieved by exploiting the variance in the past observed rewards. We name the algorithm Capitalizing On Rewards (CORe). The algorithm is general and can be easily applied to different bandit settings. The main benefit of CORe is that its ex… ▽ More

    Submitted 7 March, 2021; originally announced March 2021.

  20. arXiv:2102.06129  [pdf, other

    cs.LG stat.ML

    Meta-Thompson Sampling

    Authors: Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-wei Hsu, Martin Mladenov, Craig Boutilier, Csaba Szepesvari

    Abstract: Efficient exploration in bandits is a fundamental online learning problem. We propose a variant of Thompson sampling that learns to explore better as it interacts with bandit instances drawn from an unknown prior. The algorithm meta-learns the prior and thus we call it MetaTS. We propose several efficient implementations of MetaTS and analyze it in Gaussian bandits. Our analysis shows the benefit… ▽ More

    Submitted 23 June, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

    Comments: Proceedings of the 38th International Conference on Machine Learning

  21. arXiv:2007.04915  [pdf, other

    cs.LG stat.ML

    Influence Diagram Bandits: Variational Thompson Sampling for Structured Bandit Problems

    Authors: Tong Yu, Branislav Kveton, Zheng Wen, Ruiyi Zhang, Ole J. Mengshoel

    Abstract: We propose a novel framework for structured bandits, which we call an influence diagram bandit. Our framework captures complex statistical dependencies between actions, latent variables, and observations; and thus unifies and extends many existing models, such as combinatorial semi-bandits, cascading bandits, and low-rank bandits. We develop novel online learning algorithms that learn to act effic… ▽ More

    Submitted 9 July, 2020; originally announced July 2020.

  22. arXiv:2006.08714  [pdf, other

    cs.LG cs.AI stat.ML

    Latent Bandits Revisited

    Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed, Craig Boutilier

    Abstract: A latent bandit problem is one in which the learning agent knows the arm reward distributions conditioned on an unknown discrete latent state. The primary goal of the agent is to identify the latent state, after which it can act optimally. This setting is a natural midpoint between online and offline learning---complex models can be learned offline with the agent identifying latent state online---… ▽ More

    Submitted 15 June, 2020; originally announced June 2020.

    Comments: 16 pages, 2 figures

  23. arXiv:2006.08236  [pdf, other

    cs.LG cs.AI stat.ML

    Non-Stationary Off-Policy Optimization

    Authors: Joey Hong, Branislav Kveton, Manzil Zaheer, Yinlam Chow, Amr Ahmed

    Abstract: Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution… ▽ More

    Submitted 4 April, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: AISTATS 2021; 16 pages, 2 figures

  24. arXiv:2006.05094  [pdf, other

    cs.LG stat.ML

    Meta-Learning Bandit Policies by Gradient Ascent

    Authors: Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, Craig Boutilier

    Abstract: Most bandit policies are designed to either minimize regret in any problem instance, making very few assumptions about the underlying environment, or in a Bayesian sense, assuming a prior distribution over environment parameters. The former are often too conservative in practical settings, while the latter require assumptions that are hard to verify in practice. We study bandit problems that fall… ▽ More

    Submitted 5 January, 2021; v1 submitted 9 June, 2020; originally announced June 2020.

  25. arXiv:2006.02672  [pdf, other

    cs.LG stat.ML

    Sample Efficient Graph-Based Optimization with Noisy Observations

    Authors: Tan Nguyen, Ali Shameli, Yasin Abbasi-Yadkori, Anup Rao, Branislav Kveton

    Abstract: We study sample complexity of optimizing "hill-climbing friendly" functions defined on a graph under noisy observations. We define a notion of convexity, and we show that a variant of best-arm identification can find a near-optimal solution after a small number of queries that is independent of the size of the graph. For functions that have local minima and are nearly convex, we show a sample comp… ▽ More

    Submitted 4 June, 2020; originally announced June 2020.

    Comments: The first version of this paper appeared in AISTATS 2019. Thank to community feedback, some typos and a minor issue have been identified. Specifically, on page 4, column 2, line 18, the statement $Δ_{1,s} \ge (1+m)^{S-1-s} Δ_1$ is not valid, and in the proof of Theorem 2, "By Lemma 1" should be "By Definition 2". These problems are fixed in this updated version published here on arxiv

    Journal ref: AISTATS 2019

  26. arXiv:2002.06772  [pdf, other

    cs.LG stat.ML

    Differentiable Bandit Exploration

    Authors: Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, Manzil Zaheer

    Abstract: Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution $\mathcal{P}$. In this work, we learn such policies for an unknown distribution $\mathcal{P}$ using samples from $\mathcal{P}$. Our approach is a form of meta-learning and exploits properties of $\mathcal{P}$ without making strong assumptions about its form. To do this, we param… ▽ More

    Submitted 9 June, 2020; v1 submitted 17 February, 2020; originally announced February 2020.

  27. arXiv:1910.04928  [pdf, other

    cs.LG stat.ML

    Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

    Authors: Sharan Vaswani, Abbas Mehrabian, Audrey Durand, Branislav Kveton

    Abstract: We propose $\tt RandUCB$, a bandit strategy that builds on theoretically derived confidence intervals similar to upper confidence bound (UCB) algorithms, but akin to Thompson sampling (TS), it uses randomization to trade off exploration and exploitation. In the $K$-armed bandit setting, we show that there are infinitely many variants of $\tt RandUCB$, all of which achieve the minimax-optimal… ▽ More

    Submitted 22 March, 2020; v1 submitted 10 October, 2019; originally announced October 2019.

    Comments: AISTATS 2020

  28. arXiv:1906.08947  [pdf, other

    cs.LG stat.ML

    Randomized Exploration in Generalized Linear Bandits

    Authors: Branislav Kveton, Manzil Zaheer, Csaba Szepesvari, Lihong Li, Mohammad Ghavamzadeh, Craig Boutilier

    Abstract: We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the num… ▽ More

    Submitted 10 July, 2023; v1 submitted 21 June, 2019; originally announced June 2019.

    Comments: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistic

  29. arXiv:1904.09404  [pdf, ps, other

    cs.LG stat.ML

    Waterfall Bandits: Learning to Sell Ads Online

    Authors: Branislav Kveton, Saied Mahdian, S. Muthukrishnan, Zheng Wen, Yikun Xian

    Abstract: A popular approach to selling online advertising is by a waterfall, where a publisher makes sequential price offers to ad networks for an inventory, and chooses the winner in that order. The publisher picks the order and prices to maximize her revenue. A traditional solution is to learn the demand model and then subsequently solve the optimization problem for the given demand model. This will incu… ▽ More

    Submitted 20 April, 2019; originally announced April 2019.

  30. arXiv:1904.02664  [pdf, other

    cs.LG stat.ML

    Empirical Bayes Regret Minimization

    Authors: Chih-Wei Hsu, Branislav Kveton, Ofer Meshi, Martin Mladenov, Csaba Szepesvari

    Abstract: Most bandit algorithm designs are purely theoretical. Therefore, they have strong regret guarantees, but also are often too conservative in practice. In this work, we pioneer the idea of algorithm design by minimizing the empirical Bayes regret, the average regret over problem instances sampled from a known distribution. We focus on a tractable instance of this problem, the confidence interval and… ▽ More

    Submitted 10 June, 2020; v1 submitted 4 April, 2019; originally announced April 2019.

  31. arXiv:1903.09132  [pdf, other

    cs.LG stat.ML

    Perturbed-History Exploration in Stochastic Linear Bandits

    Authors: Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier

    Abstract: We propose a new online algorithm for cumulative regret minimization in a stochastic linear bandit. The algorithm pulls the arm with the highest estimated reward in a linear model trained on its perturbed history. Therefore, we call it perturbed-history exploration in a linear bandit (LinPHE). The perturbed history is a mixture of observed rewards and randomly generated i.i.d. pseudo-rewards. We d… ▽ More

    Submitted 10 July, 2023; v1 submitted 21 March, 2019; originally announced March 2019.

    Comments: Proceedings of the 35th Conference on Uncertainty in Artificial Intelligence

  32. arXiv:1902.10089  [pdf, other

    cs.LG stat.ML

    Perturbed-History Exploration in Stochastic Multi-Armed Bandits

    Authors: Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier

    Abstract: We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds $O(t)$ i.i.d. pseudo-rewards to its history in round $t$ and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are carefully designed to offset potentially underestimated mea… ▽ More

    Submitted 5 November, 2019; v1 submitted 26 February, 2019; originally announced February 2019.

  33. arXiv:1811.05154  [pdf, other

    cs.LG stat.ML

    Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

    Authors: Branislav Kveton, Csaba Szepesvari, Sharan Vaswani, Zheng Wen, Mohammad Ghavamzadeh, Tor Lattimore

    Abstract: We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for garbage in, reward out. We an… ▽ More

    Submitted 20 June, 2019; v1 submitted 13 November, 2018; originally announced November 2018.

    Comments: Proceedings of the 36th International Conference on Machine Learning

  34. arXiv:1811.00911  [pdf, other

    cs.IR cs.LG stat.ML

    Online Diverse Learning to Rank from Partial-Click Feedback

    Authors: Prakhar Gupta, Gaurush Hiranandani, Harvineet Singh, Branislav Kveton, Zheng Wen, Iftikhar Ahamath Burhanuddin

    Abstract: Learning to rank is an important problem in machine learning and recommender systems. In a recommender system, a user is typically recommended a list of items. Since the user is unlikely to examine the entire recommended list, partial feedback arises naturally. At the same time, diverse recommendations are important because it is challenging to model all tastes of the user in practice. In this pap… ▽ More

    Submitted 21 November, 2018; v1 submitted 31 October, 2018; originally announced November 2018.

    Comments: The first three authors contributed equally to this work. 24 pages, 4 figures, 1 table

  35. arXiv:1806.05819  [pdf, other

    cs.LG stat.ML

    BubbleRank: Safe Online Learning to Re-Rank via Implicit Click Feedback

    Authors: Chang Li, Branislav Kveton, Tor Lattimore, Ilya Markov, Maarten de Rijke, Csaba Szepesvari, Masrour Zoghi

    Abstract: In this paper, we study the problem of safe online learning to re-rank, where user feedback is used to improve the quality of displayed lists. Learning to rank has traditionally been studied in two settings. In the offline setting, rankers are typically learned from relevance labels created by judges. This approach has generally become standard in industrial applications of ranking, such as search… ▽ More

    Submitted 29 June, 2019; v1 submitted 15 June, 2018; originally announced June 2018.

  36. arXiv:1806.02248  [pdf, other

    stat.ML cs.LG

    TopRank: A practical algorithm for online stochastic ranking

    Authors: Tor Lattimore, Branislav Kveton, Shuai Li, Csaba Szepesvari

    Abstract: Online learning to rank is a sequential decision-making problem where in each round the learning agent chooses a list of items and receives feedback in the form of clicks from the user. Many sample-efficient algorithms have been proposed for this problem that assume a specific click model connecting rankings and user behavior. We propose a generalized click model that encompasses many existing mod… ▽ More

    Submitted 18 March, 2019; v1 submitted 6 June, 2018; originally announced June 2018.

  37. arXiv:1806.00892  [pdf, other

    stat.ML cs.LG

    Conservative Exploration using Interleaving

    Authors: Sumeet Katariya, Branislav Kveton, Zheng Wen, Vamsi K. Potluru

    Abstract: In many practical problems, a learning agent may want to learn the best action in hindsight without ever taking a bad action, which is significantly worse than the default production action. In general, this is impossible because the agent has to explore unknown actions, some of which can be bad, to learn better actions. However, when the actions are combinatorial, this may be possible if the unkn… ▽ More

    Submitted 3 June, 2018; originally announced June 2018.

  38. arXiv:1805.09793  [pdf, other

    cs.LG stat.ML

    New Insights into Bootstrapping for Bandits

    Authors: Sharan Vaswani, Branislav Kveton, Zheng Wen, Anup Rao, Mark Schmidt, Yasin Abbasi-Yadkori

    Abstract: We investigate the use of bootstrapping in the bandit setting. We first show that the commonly used non-parametric bootstrapping (NPB) procedure can be provably inefficient and establish a near-linear lower bound on the regret incurred by it under the bandit model with Bernoulli rewards. We show that NPB with an appropriate amount of forced exploration can result in sub-linear albeit sub-optimal r… ▽ More

    Submitted 24 May, 2018; originally announced May 2018.

  39. arXiv:1804.10488  [pdf, other

    cs.LG stat.ML

    Offline Evaluation of Ranking Policies with Click Models

    Authors: Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, Zheng Wen

    Abstract: Many web systems rank and present a list of items to users, from recommender systems to search and advertising. An important problem in practice is to evaluate new ranking policies offline and optimize them before they are deployed. We address this problem by proposing evaluation algorithms for estimating the expected number of clicks on ranked lists from historical logged data. The existing algor… ▽ More

    Submitted 13 June, 2018; v1 submitted 27 April, 2018; originally announced April 2018.

  40. arXiv:1802.03692  [pdf, other

    stat.ML cs.LG

    Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit

    Authors: Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie

    Abstract: Multi-armed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. We consider a scenario where the reward distributions may change in a piecewise-stationary fashion at unknown time steps. We show that by incorporating a simple change-detection component wi… ▽ More

    Submitted 24 January, 2019; v1 submitted 10 February, 2018; originally announced February 2018.

  41. arXiv:1712.04644  [pdf, other

    cs.LG stat.ML

    Stochastic Low-Rank Bandits

    Authors: Branislav Kveton, Csaba Szepesvari, Anup Rao, Zheng Wen, Yasin Abbasi-Yadkori, S. Muthukrishnan

    Abstract: Many problems in computer vision and recommender systems involve low-rank matrices. In this work, we study the problem of finding the maximum entry of a stochastic low-rank matrix from sequential observations. At each step, a learning agent chooses pairs of row and column arms, and receives the noisy product of their latent values as a reward. The main challenge is that the latent values are unobs… ▽ More

    Submitted 13 December, 2017; originally announced December 2017.

  42. arXiv:1709.07172  [pdf, ps, other

    cs.LG stat.ML

    SpectralLeader: Online Spectral Learning for Single Topic Models

    Authors: Tong Yu, Branislav Kveton, Zheng Wen, Hung Bui, Ole J. Mengshoel

    Abstract: We study the problem of learning a latent variable model from a stream of data. Latent variable models are popular in practice because they can explain observed data in terms of unobserved concepts. These models have been traditionally studied in the offline setting. In the online setting, on the other hand, the online EM is arguably the most popular algorithm for learning latent variable models.… ▽ More

    Submitted 25 April, 2018; v1 submitted 21 September, 2017; originally announced September 2017.

    Comments: 17 pages, 2 figures

  43. arXiv:1703.06513  [pdf, other

    cs.LG stat.ML

    Bernoulli Rank-$1$ Bandits for Click Feedback

    Authors: Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Claire Vernade, Zheng Wen

    Abstract: The probability that a user will click a search result depends both on its relevance and its position on the results page. The position based model explains this behavior by ascribing to every item an attraction probability, and to every position an examination probability. To be clicked, a result must be both attractive and examined. The probabilities of an item-position pair being clicked thus f… ▽ More

    Submitted 19 March, 2017; originally announced March 2017.

  44. arXiv:1703.02527  [pdf, other

    cs.LG stat.ML

    Online Learning to Rank in Stochastic Click Models

    Authors: Masrour Zoghi, Tomas Tunys, Mohammad Ghavamzadeh, Branislav Kveton, Csaba Szepesvari, Zheng Wen

    Abstract: Online learning to rank is a core problem in information retrieval and machine learning. Many provably efficient algorithms have been recently proposed for this problem in specific click models. The click model is a model of how the user interacts with a list of documents. Though these results are significant, their impact on practice is limited, because all proposed algorithms are designed for sp… ▽ More

    Submitted 20 June, 2017; v1 submitted 7 March, 2017; originally announced March 2017.

    Comments: Proceedings of the 34th International Conference on Machine Learning

  45. arXiv:1608.03023  [pdf, other

    cs.LG stat.ML

    Stochastic Rank-1 Bandits

    Authors: Sumeet Katariya, Branislav Kveton, Csaba Szepesvari, Claire Vernade, Zheng Wen

    Abstract: We propose stochastic rank-$1$ bandits, a class of online learning problems where at each step a learning agent chooses a pair of row and column arms, and receives the product of their values as a reward. The main challenge of the problem is that the individual values of the row and column are unobserved. We assume that these values are stochastic and drawn independently. We propose a computationa… ▽ More

    Submitted 8 March, 2017; v1 submitted 9 August, 2016; originally announced August 2016.

    Comments: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

  46. arXiv:1605.06593  [pdf, other

    cs.LG cs.AI cs.SI math.OC stat.ML

    Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback

    Authors: Zheng Wen, Branislav Kveton, Michal Valko, Sharan Vaswani

    Abstract: We study the online influence maximization problem in social networks under the independent cascade model. Specifically, we aim to learn the set of "best influencers" in a social network online while repeatedly interacting with it. We address the challenges of (i) combinatorial action space, since the number of feasible influencer sets grows exponentially with the maximum number of influencers, an… ▽ More

    Submitted 19 June, 2018; v1 submitted 21 May, 2016; originally announced May 2016.

    Comments: Compared with the previous version, this version has fixed a mistake. This version is also consistent with the NIPS camera-ready version

    Journal ref: Z. Wen, B. Kveton, M. Valko, and S. Vaswani, "Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback", Advances in Neural Information Processing Systems 30 Proceedings, 2017

  47. arXiv:1603.05359  [pdf, ps, other

    cs.LG stat.ML

    Cascading Bandits for Large-Scale Recommendation Problems

    Authors: Shi Zong, Hao Ni, Kenny Sung, Nan Rosemary Ke, Zheng Wen, Branislav Kveton

    Abstract: Most recommender systems recommend a list of items. The user examines the list, from the first item to the last, and often chooses the first attractive item and does not examine the rest. This type of user behavior can be modeled by the cascade model. In this work, we study cascading bandits, an online learning variant of the cascade model where the goal is to recommend $K$ most attractive items f… ▽ More

    Submitted 30 June, 2016; v1 submitted 17 March, 2016; originally announced March 2016.

    Comments: Accepted to UAI 2016

  48. arXiv:1602.03146  [pdf, other

    cs.LG stat.ML

    DCM Bandits: Learning to Rank with Multiple Clicks

    Authors: Sumeet Katariya, Branislav Kveton, Csaba Szepesvári, Zheng Wen

    Abstract: A search engine recommends to the user a list of web pages. The user examines this list, from the first page to the last, and clicks on all attractive pages until the user is satisfied. This behavior of the user can be described by the dependent click model (DCM). We propose DCM bandits, an online learning variant of the DCM where the goal is to maximize the probability of recommending satisfactor… ▽ More

    Submitted 31 May, 2016; v1 submitted 9 February, 2016; originally announced February 2016.

    Comments: Proceedings of the 33rd International Conference on Machine Learning

  49. arXiv:1602.03105  [pdf, other

    cs.DS cs.LG stat.ML

    Graphical Model Sketch

    Authors: Branislav Kveton, Hung Bui, Mohammad Ghavamzadeh, Georgios Theocharous, S. Muthukrishnan, Siqi Sun

    Abstract: Structured high-cardinality data arises in many domains, and poses a major challenge for both modeling and inference. Graphical models are a popular approach to modeling structured data but they are unsuitable for high-cardinality variables. The count-min (CM) sketch is a popular approach to estimating probabilities in high-cardinality data but it does not scale well beyond a few variables. In thi… ▽ More

    Submitted 18 July, 2016; v1 submitted 9 February, 2016; originally announced February 2016.

    Comments: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases

  50. arXiv:1507.04208  [pdf, other

    cs.LG stat.ML

    Combinatorial Cascading Bandits

    Authors: Branislav Kveton, Zheng Wen, Azin Ashkan, Csaba Szepesvari

    Abstract: We propose combinatorial cascading bandits, a class of partial monitoring problems where at each step a learning agent chooses a tuple of ground items subject to constraints and receives a reward if and only if the weights of all chosen items are one. The weights of the items are binary, stochastic, and drawn independently of each other. The agent observes the index of the first chosen item whose… ▽ More

    Submitted 17 November, 2015; v1 submitted 15 July, 2015; originally announced July 2015.

    Comments: Advances in Neural Information Processing Systems 28