-
Entropy-Based Strategies for Multi-Bracket Pools
Authors:
Ryan S. Brill,
Abraham J. Wyner,
Ian J. Barnett
Abstract:
Much work in the parimutuel betting literature has discussed estimating event outcome probabilities or developing optimal wagering strategies, particularly for horse race betting. Some betting pools, however, involve betting not just on a single event, but on a tuple of events. For example, pick six betting in horse racing, March Madness bracket challenges, and predicting a randomly drawn bitstrin…
▽ More
Much work in the parimutuel betting literature has discussed estimating event outcome probabilities or developing optimal wagering strategies, particularly for horse race betting. Some betting pools, however, involve betting not just on a single event, but on a tuple of events. For example, pick six betting in horse racing, March Madness bracket challenges, and predicting a randomly drawn bitstring each involve making a series of individual forecasts. Although traditional optimal wagering strategies work well when the size of the tuple is very small (e.g., betting on the winner of a horse race), they are intractable for more general betting pools in higher dimensions (e.g., March Madness bracket challenges). Hence we pose the multi-brackets problem: supposing we wish to predict a tuple of events and that we know the true probabilities of each potential outcome of each event, what is the best way to tractably generate a set of $n$ predicted tuples? The most general version of this problem is extremely difficult, so we begin with a simpler setting. In particular, we generate $n$ independent predicted tuples according to a distribution having optimal entropy. This entropy-based approach is tractable, scalable, and performs well.
△ Less
Submitted 20 March, 2024; v1 submitted 28 August, 2023;
originally announced August 2023.
-
Introducing Grid WAR: Rethinking WAR for Starting Pitchers
Authors:
Ryan S. Brill,
Abraham J. Wyner
Abstract:
The baseball statistic "Wins Above Replacement" (WAR) has emerged as one of the most popular evaluation metrics. But it is not readily observed and tabulated; WAR is an estimate of a parameter in a vaguely defined model with all its attendant assumptions. Industry-standard models of WAR for starting pitchers from FanGraphs and Baseball Reference all assume that season-long averages are sufficient…
▽ More
The baseball statistic "Wins Above Replacement" (WAR) has emerged as one of the most popular evaluation metrics. But it is not readily observed and tabulated; WAR is an estimate of a parameter in a vaguely defined model with all its attendant assumptions. Industry-standard models of WAR for starting pitchers from FanGraphs and Baseball Reference all assume that season-long averages are sufficient statistics for a pitcher's performance. This provides an invalid mathematical foundation for many reasons, especially because WAR should not be linear with respect to any counting statistic. To repair this defect, as well as many others, we devise a new measure, Grid WAR, which accurately estimates a starting pitcher's WAR on a per-game basis. The convexity of Grid WAR diminishes the impact of "blow-up" games and upweights exceptional games, raising the valuation of pitchers like Sandy Koufax, Whitey Ford, and Catfish Hunter who exhibit fundamental game-by-game variance. Grid WAR is designed to accurately measure past performance, but also has predictive value insofar as a pitcher's Grid WAR is better than WAR at predicting future performance. Finally, at https://gridwar.xyz we host a Shiny app which displays the Grid WAR results of each MLB game since 1952, including career, season, and game level results, which updates automatically every morning.
△ Less
Submitted 9 February, 2024; v1 submitted 12 September, 2022;
originally announced September 2022.
-
Making Sense of Random Forest Probabilities: a Kernel Perspective
Authors:
Matthew A. Olson,
Abraham J. Wyner
Abstract:
A random forest is a popular tool for estimating probabilities in machine learning classification tasks. However, the means by which this is accomplished is unprincipled: one simply counts the fraction of trees in a forest that vote for a certain class. In this paper, we forge a connection between random forests and kernel regression. This places random forest probability estimation on more sound…
▽ More
A random forest is a popular tool for estimating probabilities in machine learning classification tasks. However, the means by which this is accomplished is unprincipled: one simply counts the fraction of trees in a forest that vote for a certain class. In this paper, we forge a connection between random forests and kernel regression. This places random forest probability estimation on more sound statistical footing. As part of our investigation, we develop a model for the proximity kernel and relate it to the geometry and sparsity of the estimation problem. We also provide intuition and recommendations for tuning a random forest to improve its probability estimates.
△ Less
Submitted 14 December, 2018;
originally announced December 2018.
-
Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers
Authors:
Abraham J. Wyner,
Matthew Olson,
Justin Bleich,
David Mease
Abstract:
There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explana…
▽ More
There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting's interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self-averaging, interpolating algorithm which creates what we denote as a "spikey-smooth" classifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples and some theoretical justification to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees and without direct regularization or early stopping.
△ Less
Submitted 29 April, 2017; v1 submitted 28 April, 2015;
originally announced April 2015.