Lessons Learned Addressing Dataset Bias in Model-Based Candidate Generation at Twitter
Authors:
Alim Virani,
Jay Baxter,
Dan Shiebler,
Philip Gautier,
Shivam Verma,
Yan Xia,
Apoorv Sharma,
Sumit Binnani,
Linlin Chen,
Chenguang Yu
Abstract:
Traditionally, heuristic methods are used to generate candidates for large scale recommender systems. Model-based candidate generation promises multiple potential advantages, primarily that we can explicitly optimize the same objective as the downstream ranking model. However, large scale model-based candidate generation approaches suffer from dataset bias problems caused by the infeasibility of o…
▽ More
Traditionally, heuristic methods are used to generate candidates for large scale recommender systems. Model-based candidate generation promises multiple potential advantages, primarily that we can explicitly optimize the same objective as the downstream ranking model. However, large scale model-based candidate generation approaches suffer from dataset bias problems caused by the infeasibility of obtaining representative data on very irrelevant candidates. Popular techniques to correct dataset bias, such as inverse propensity scoring, do not work well in the context of candidate generation. We first explore the dynamics of the dataset bias problem and then demonstrate how to use random sampling techniques to mitigate it. Finally, in a novel application of fine-tuning, we show performance gains when applying our candidate generation system to Twitter's home timeline.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
Amazon SageMaker Autopilot: a white box AutoML solution at scale
Authors:
Piali Das,
Valerio Perrone,
Nikita Ivkin,
Tanya Bansal,
Zohar Karnin,
Huibin Shen,
Iaroslav Shcherbatyi,
Yotam Elor,
Wilton Wu,
Aida Zolic,
Thibaut Lienart,
Alex Tang,
Amr Ahmed,
Jean Baptiste Faddoul,
Rodolphe Jenatton,
Fela Winkelmolen,
Philip Gautier,
Leo Dirac,
Andre Perunicic,
Miroslav Miladinovic,
Giovanni Zappella,
Cédric Archambeau,
Matthias Seeger,
Bhaskar Dutt,
Laurence Rouesnel
Abstract:
AutoML systems provide a black-box solution to machine learning problems by selecting the right way of processing features, choosing an algorithm and tuning the hyperparameters of the entire pipeline. Although these systems perform well on many datasets, there is still a non-negligible number of datasets for which the one-shot solution produced by each particular system would provide sub-par perfo…
▽ More
AutoML systems provide a black-box solution to machine learning problems by selecting the right way of processing features, choosing an algorithm and tuning the hyperparameters of the entire pipeline. Although these systems perform well on many datasets, there is still a non-negligible number of datasets for which the one-shot solution produced by each particular system would provide sub-par performance. In this paper, we present Amazon SageMaker Autopilot: a fully managed system providing an automated ML solution that can be modified when needed. Given a tabular dataset and the target column name, Autopilot identifies the problem type, analyzes the data and produces a diverse set of complete ML pipelines including feature preprocessing and ML algorithms, which are tuned to generate a leaderboard of candidate models. In the scenario where the performance is not satisfactory, a data scientist is able to view and edit the proposed ML pipelines in order to infuse their expertise and business knowledge without having to revert to a fully manual solution. This paper describes the different components of Autopilot, emphasizing the infrastructure choices that allow scalability, high quality models, editable ML pipelines, consumption of artifacts of offline meta-learning, and a convenient integration with the entire SageMaker suite allowing these trained models to be used in a production setting.
△ Less
Submitted 16 December, 2020; v1 submitted 15 December, 2020;
originally announced December 2020.