Skip to main content

Showing 1–16 of 16 results for author: Mentch, L

Searching in archive stat. Search in all archives.
.
  1. arXiv:2103.16700  [pdf, other

    stat.ML cs.LG

    Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest

    Authors: Siyu Zhou, Lucas Mentch

    Abstract: Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

  2. arXiv:2103.03462  [pdf, other

    stat.ME stat.AP stat.CO

    Forward Stability and Model Path Selection

    Authors: Nicholas Kissel, Lucas Mentch

    Abstract: Most scientific publications follow the familiar recipe of (i) obtain data, (ii) fit a model, and (iii) comment on the scientific relevance of the effects of particular covariates in that model. This approach, however, ignores the fact that there may exist a multitude of similarly-accurate models in which the implied effects of individual covariates may be vastly different. This problem of finding… ▽ More

    Submitted 4 March, 2021; originally announced March 2021.

  3. arXiv:2102.12328  [pdf, ps, other

    stat.OT cs.LG

    Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

    Authors: Lucas Mentch, Giles Hooker

    Abstract: In 2001, Leo Breiman wrote of a divide between "data modeling" and "algorithmic modeling" cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the "data modelers" incorporating algorithmic methods into their toolbox, particularly driven by recent developmen… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: In response to the Journal of Observational Studies reprinting Leo Breiman's paper "Statistical Modeling: The Two Cultures" on its 20th anniversary

  4. arXiv:2004.14500  [pdf, other

    cs.CL cs.LG stat.ML

    Posterior Calibrated Training on Sentence Classification Tasks

    Authors: Taehee Jung, Dongyeop Kang, Hua Cheng, Lucas Mentch, Thomas Schaaf

    Abstract: Most classification models work by first predicting a posterior probability distribution over all classes and then selecting that class with the largest estimated probability. In many settings however, the quality of posterior probability itself (e.g., 65% chance having diabetes), gives more reliable information than the final predicted class alone. When these methods are shown to be poorly calibr… ▽ More

    Submitted 1 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020

  5. arXiv:2003.03629  [pdf, other

    stat.ML cs.LG

    Getting Better from Worse: Augmented Bagging and a Cautionary Tale of Variable Importance

    Authors: Lucas Mentch, Siyu Zhou

    Abstract: As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications. Tools like random forests have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships among varia… ▽ More

    Submitted 9 November, 2020; v1 submitted 7 March, 2020; originally announced March 2020.

  6. arXiv:1912.03018  [pdf, other

    stat.AP

    On Racial Disparities in Recent Fatal Police Shootings

    Authors: Lucas Mentch

    Abstract: Fatal police shootings in the United States continue to be a polarizing social and political issue. Clear disagreement between racial proportions of victims and nationwide racial demographics together with graphic video footage has created fertile ground for controversy. However, simple population level summary statistics fail to take into account fundamental local characteristics such as county-l… ▽ More

    Submitted 6 December, 2019; originally announced December 2019.

    Comments: Accepted at Statistics and Public Policy

  7. arXiv:1912.01089  [pdf, other

    stat.ML cs.LG stat.CO stat.ME

    $V$-statistics and Variance Estimation

    Authors: Zhengze Zhou, Lucas Mentch, Giles Hooker

    Abstract: This paper develops a general framework for analyzing asymptotics of $V$-statistics. Previous literature on limiting distribution mainly focuses on the cases when $n \to \infty$ with fixed kernel size $k$. Under some regularity conditions, we demonstrate asymptotic normality when $k$ grows with $n$ by utilizing existing results for $U$-statistics. The key in our approach lies in a mathematical red… ▽ More

    Submitted 6 May, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Comments: This version supersedes the previous technical report titled "Asymptotic Normality and Variance Estimation For Supervised Ensembles". Extensive simulations are added and we also provide a more detailed discussion on the bias phenomenon in variance estimation

  8. arXiv:1911.00190  [pdf, other

    stat.ML cs.LG

    Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

    Authors: Lucas Mentch, Siyu Zhou

    Abstract: Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim t… ▽ More

    Submitted 14 September, 2020; v1 submitted 31 October, 2019; originally announced November 2019.

    Comments: To Appear in the Journal of Machine Learning Research (JMLR)

  9. arXiv:1908.09967  [pdf, other

    stat.ML cs.LG stat.ME

    Locally Optimized Random Forests

    Authors: Tim Coleman, Kimberly Kaufeld, Mary Frances Dorn, Lucas Mentch

    Abstract: Standard supervised learning procedures are validated against a test set that is assumed to have come from the same distribution as the training data. However, in many problems, the test data may have come from a different distribution. We consider the case of having many labeled observations from one distribution, $P_1$, and making predictions at unlabeled points that come from $P_2$. We combine… ▽ More

    Submitted 26 August, 2019; originally announced August 2019.

    Comments: 23 pages, 7 figures

  10. arXiv:1905.10651  [pdf, other

    stat.ML cs.LG math.ST

    Asymptotic Distributions and Rates of Convergence for Random Forests via Generalized U-statistics

    Authors: Wei Peng, Tim Coleman, Lucas Mentch

    Abstract: Random forests remain among the most popular off-the-shelf supervised learning algorithms. Despite their well-documented empirical success, however, until recently, few theoretical results were available to describe their performance and behavior. In this work we push beyond recent work on consistency and asymptotic normality by establishing rates of convergence for random forests and other superv… ▽ More

    Submitted 16 November, 2021; v1 submitted 25 May, 2019; originally announced May 2019.

    Comments: 76 pages, 7 figure

  11. arXiv:1905.03151  [pdf, other

    stat.ME cs.LG stat.ML

    Unrestricted Permutation forces Extrapolation: Variable Importance Requires at least One More Model, or There Is No Free Variable Importance

    Authors: Giles Hooker, Lucas Mentch, Siyu Zhou

    Abstract: This paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationall… ▽ More

    Submitted 7 October, 2021; v1 submitted 1 May, 2019; originally announced May 2019.

    MSC Class: 62G08 ACM Class: I.5.1

  12. arXiv:1904.07830  [pdf, other

    stat.ME stat.ML

    Scalable and Efficient Hypothesis Testing with Random Forests

    Authors: Tim Coleman, Wei Peng, Lucas Mentch

    Abstract: Throughout the last decade, random forests have established themselves as among the most accurate and popular supervised learning methods. While their black-box nature has made their mathematical analysis difficult, recent work has established important statistical properties like consistency and asymptotic normality by considering subsampling in lieu of bootstrapping. Though such results open the… ▽ More

    Submitted 6 December, 2019; v1 submitted 16 April, 2019; originally announced April 2019.

    Comments: 52 pages, 10 figures [fixed some critical typo's with Algorithm 1]

  13. arXiv:1710.09793  [pdf, other

    q-bio.PE stat.AP

    Statistical Inference on Tree Swallow Migrations with Random Forests

    Authors: Tim Coleman, Lucas Mentch, Daniel Fink, Frank La Sorte, Giles Hooker, Wesley Hochachka, David Winkler

    Abstract: Bird species' migratory patterns have typically been studied through individual observations and historical records. In recent years however, the eBird citizen science project, which solicits observations from thousands of bird watchers around the world, has opened the door for a data-driven approach to understanding the large-scale geographical movements. Here, we focus on the North American Tree… ▽ More

    Submitted 8 November, 2019; v1 submitted 26 October, 2017; originally announced October 2017.

    Comments: 23 pages, 7 figures. Work between Cornell Lab of Ornithology and University of Pittsburgh Department of Statistics

  14. arXiv:1506.00553  [pdf, other

    stat.ML

    Bootstrap Bias Corrections for Ensemble Methods

    Authors: Giles Hooker, Lucas Mentch

    Abstract: This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning methods. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context o… ▽ More

    Submitted 1 June, 2015; originally announced June 2015.

  15. arXiv:1406.1845  [pdf, other

    stat.ML stat.AP

    Formal Hypothesis Tests for Additive Structure in Random Forests

    Authors: Lucas Mentch, Giles Hooker

    Abstract: While statistical learning methods have proved powerful tools for predictive modeling, the black-box nature of the models they produce can severely limit their interpretability and the ability to conduct formal inference. However, the natural structure of ensemble learners like bagged trees and random forests has been shown to admit desirable asymptotic properties when base learners are built with… ▽ More

    Submitted 26 August, 2016; v1 submitted 6 June, 2014; originally announced June 2014.

  16. arXiv:1404.6473  [pdf, other

    stat.ML stat.AP stat.CO stat.ME

    Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

    Authors: Lucas Mentch, Giles Hooker

    Abstract: This work develops formal statistical inference procedures for machine learning ensemble methods. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by ave… ▽ More

    Submitted 10 September, 2015; v1 submitted 25 April, 2014; originally announced April 2014.

    Comments: To appear in The Journal of Machine Learning Research