Skip to main content

Showing 1–8 of 8 results for author: Kumbier, K

Searching in archive stat. Search in all archives.
.
  1. arXiv:2403.08971  [pdf, other

    stat.CO

    Designing a Data Science simulation with MERITS: A Primer

    Authors: Corrine F Elliott, James PC Duncan, Tiffany M Tang, Merle Behr, Karl Kumbier, Bin Yu

    Abstract: Simulations play a crucial role in the modern scientific process. Yet despite (or due to) this ubiquity, the Data Science community shares neither a comprehensive definition for a "high-quality" study nor a consolidated guide to designing one. Inspired by the Predictability-Computability-Stability (PCS) framework for 'veridical' Data Science, we propose six MERITS that a simulation study should sa… ▽ More

    Submitted 15 May, 2025; v1 submitted 13 March, 2024; originally announced March 2024.

    Comments: 31 pages (main text); 1 figure; 2 tables; James PC Duncan, Tiffany M Tang: Authors contributed equally to this manuscript; Merle Behr, Karl Kumbier: Authors contributed equally to this manuscript

  2. Curating a COVID-19 data repository and forecasting county-level death counts in the United States

    Authors: Nick Altieri, Rebecca L. Barter, James Duncan, Raaz Dwivedi, Karl Kumbier, Xiao Li, Robert Netzorg, Briton Park, Chandan Singh, Yan Shuo Tan, Tiffany Tang, Yu Wang, Chao Zhang, Bin Yu

    Abstract: As the COVID-19 outbreak evolves, accurate forecasting continues to play an extremely important role in informing policy decisions. In this paper, we present our continuous curation of a large data repository containing COVID-19 information from a range of sources. We use this data to develop predictions and corresponding prediction intervals for the short-term trajectory of COVID-19 cumulative de… ▽ More

    Submitted 9 August, 2020; v1 submitted 16 May, 2020; originally announced May 2020.

    Comments: Authors ordered alphabetically. All authors contributed significantly to this work. All collected data, modeling code, forecasts, and visualizations are updated daily and available at \url{https://github.com/Yu-Group/covid19-severity-prediction}

    Journal ref: Published in Harvard Data Science Review, 2020

  3. arXiv:1906.10845  [pdf, other

    stat.ML cs.LG

    A Debiased MDI Feature Importance Measure for Random Forests

    Authors: Xiao Li, Yu Wang, Sumanta Basu, Karl Kumbier, Bin Yu

    Abstract: Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high imp… ▽ More

    Submitted 26 October, 2019; v1 submitted 26 June, 2019; originally announced June 2019.

    Comments: NeurIPS'19. The first two authors contributed equally to this paper

  4. Veridical Data Science

    Authors: Bin Yu, Karl Kumbier

    Abstract: Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, comprised of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow us… ▽ More

    Submitted 12 November, 2019; v1 submitted 23 January, 2019; originally announced January 2019.

  5. arXiv:1901.04592  [pdf, other

    stat.ML cs.AI cs.LG stat.AP

    Interpretable machine learning: definitions, methods, and applications

    Authors: W. James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, Bin Yu

    Abstract: Machine-learning models have demonstrated great success in learning complex patterns that enable them to make predictions about unobserved data. In addition to using models for prediction, the ability to interpret what a model has learned is receiving an increasing amount of attention. However, this increased focus has led to considerable confusion about the notion of interpretability. In particul… ▽ More

    Submitted 14 January, 2019; originally announced January 2019.

    Comments: 11 pages

    Journal ref: Published in PNAS 2019

  6. arXiv:1810.07287  [pdf, other

    stat.ML cs.LG

    Signed iterative random forests to identify enhancer-associated transcription factor binding

    Authors: Karl Kumbier, Sumanta Basu, Erwin Frise, Susan E. Celniker, James B. Brown, Susan Celniker, Bin Yu

    Abstract: Standard ChIP-seq peak calling pipelines seek to differentiate biochemically reproducible signals of individual genomic elements from background noise. However, reproducibility alone does not imply functional regulation (e.g., enhancer activation, alternative splicing). Here we present a general-purpose, interpretable machine learning method: signed iterative random forests (siRF), which we use to… ▽ More

    Submitted 12 July, 2023; v1 submitted 16 October, 2018; originally announced October 2018.

  7. arXiv:1712.03779  [pdf, ps, other

    stat.ML cs.AI

    Artificial Intelligence and Statistics

    Authors: Bin Yu, Karl Kumbier

    Abstract: Artificial intelligence (AI) is intrinsically data-driven. It calls for the application of statistical concepts through human-machine collaboration during generation of data, development of algorithms, and evaluation of results. This paper discusses how such human-machine collaboration can be approached through the statistical concepts of population, question of interest, representativeness of tra… ▽ More

    Submitted 7 December, 2017; originally announced December 2017.

  8. arXiv:1706.08457  [pdf, other

    stat.ML q-bio.GN

    Iterative Random Forests to detect predictive and stable high-order interactions

    Authors: Sumanta Basu, Karl Kumbier, James B. Brown, Bin Yu

    Abstract: Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. B… ▽ More

    Submitted 23 December, 2017; v1 submitted 26 June, 2017; originally announced June 2017.