Skip to main content

Showing 1–16 of 16 results for author: Leek, J T

.
  1. arXiv:2410.09665  [pdf, other

    stat.ME stat.CO

    ipd: An R Package for Conducting Inference on Predicted Data

    Authors: Stephen Salerno, Jiacheng Miao, Awan Afiaz, Kentaro Hoffman, Anna Neufeld, Qiongshi Lu, Tyler H. McCormick, Jeffrey T. Leek

    Abstract: Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapp… ▽ More

    Submitted 12 October, 2024; originally announced October 2024.

    Comments: 5 pages, 1 figure

  2. arXiv:2404.02438  [pdf, other

    cs.CL cs.LG stat.ML

    From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives

    Authors: Shuxian Fan, Adam Visokay, Kentaro Hoffman, Stephen Salerno, Li Liu, Jeffrey T. Leek, Tyler H. McCormick

    Abstract: In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

    Comments: 12 pages, 7 figures

  3. arXiv:2401.08702  [pdf, other

    stat.ME cs.LG

    Do We Really Even Need Data?

    Authors: Kentaro Hoffman, Stephen Salerno, Awan Afiaz, Jeffrey T. Leek, Tyler H. McCormick

    Abstract: As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association b… ▽ More

    Submitted 2 February, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

  4. arXiv:2306.03255  [pdf, other

    cs.SE q-bio.OT

    Evaluation of software impact designed for biomedical research: Are we measuring what's meaningful?

    Authors: Awan Afiaz, Andrey Ivanov, John Chamberlin, David Hanauer, Candace Savonen, Mary J Goldman, Martin Morgan, Michael Reich, Alexander Getka, Aaron Holmes, Sarthak Pati, Dan Knight, Paul C. Boutros, Spyridon Bakas, J. Gregory Caporaso, Guilherme Del Fiol, Harry Hochheiser, Brian Haas, Patrick D. Schloss, James A. Eddy, Jake Albrecht, Andrey Fedorov, Levi Waldron, Ava M. Hoffman, Richard L. Bradshaw , et al. (2 additional authors not shown)

    Abstract: Software is vital for the advancement of biology and medicine. Analysis of usage and impact metrics can help developers determine user and community engagement, justify additional funding, encourage additional use, identify unanticipated use cases, and help define improvement areas. However, there are challenges associated with these analyses including distorted or misleading metrics, as well as e… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: 25 total pages (17 pages for manuscript and 8 pages for the supplement). There are 2 figures

  5. arXiv:2305.06213  [pdf, other

    cs.CY physics.ed-ph

    Motivation, inclusivity, and realism should drive data science education

    Authors: Candace Savonen, Carrie Wright, Ava M. Hoffman, Elizabeth M. Humphries, Katherine E. L. Cox, Frederick J. Tan, Jeffrey T. Leek

    Abstract: Data science education provides tremendous opportunities but remains inaccessible to many communities. Increasing the accessibility of data science to these communities not only benefits the individuals entering data science, but also increases the field's innovation and potential impact as a whole. Education is the most scalable solution to meet these needs, but many data science educators lack f… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

    Comments: This has been submitted to F1000 and is under review (as of 5/9/23)

  6. arXiv:2203.07083  [pdf, other

    cs.CY

    Open-source Tools for Training Resources -- OTTR

    Authors: Candace Savonen, Carrie Wright, Ava M. Hoffman, John Muschelli, Katherine Cox, Frederick J. Tan, Jeffrey T. Leek

    Abstract: Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resource… ▽ More

    Submitted 10 March, 2022; originally announced March 2022.

    Comments: 19 pages, 5 figures, submitted to Journal of Statistics and Data Science Education

  7. arXiv:2201.08443  [pdf

    q-bio.OT cs.CY

    Diversifying the Genomic Data Science Research Community

    Authors: The Genomic Data Science Community Network, Rosa Alcazar, Maria Alvarez, Rachel Arnold, Mentewab Ayalew, Lyle G. Best, Michael C. Campbell, Kamal Chowdhury, Katherine E. L. Cox, Christina Daulton, Youping Deng, Carla Easter, Karla Fuller, Shazia Tabassum Hakim, Ava M. Hoffman, Natalie Kucher, Andrew Lee, Joslynn Lee, Jeffrey T. Leek, Robert Meller, Loyda B. Méndez, Miguel P. Méndez-González, Stephen Mosher, Michele Nishiguchi, Siddharth Pratap , et al. (13 additional authors not shown)

    Abstract: Over the last 20 years, there has been an explosion of genomic data collected for disease association, functional analyses, and other large-scale discoveries. At the same time, there have been revolutions in cloud computing that enable computational and data science research, while making data accessible to anyone with a web browser and an internet connection. However, students at institutions wit… ▽ More

    Submitted 9 June, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: 42 pages, 3 figures

  8. arXiv:2104.12555  [pdf, other

    cs.CY stat.AP

    Linking open-source code commits and MOOC grades to evaluate massive online open peer review

    Authors: Siruo Wang, Leah R. Jager, Kai Kammers, Aboozar Hadavand, Jeffrey T. Leek

    Abstract: Massive Open Online Courses (MOOCs) have been used by students as a low-cost and low-touch educational credential in a variety of fields. Understanding the grading mechanisms behind these course assignments is important for evaluating MOOC credentials. A common approach to grading free-response assignments is massive scale peer-review, especially used for assignments that are not easy to grade pro… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

  9. arXiv:2007.13477  [pdf

    cs.MM

    Ari: The Automated R Instructor

    Authors: Sean Kross, Jeffrey T. Leek, John Muschelli

    Abstract: We present the ari package for automatically generating technology-focused educational videos. The goal of the package is to create reproducible videos, with the ability to change and update video content seamlessly. We present several examples of generating videos including using R Markdown slide decks, PowerPoint slides, or simple images as source material. We also discuss how ari can help instr… ▽ More

    Submitted 4 August, 2020; v1 submitted 27 May, 2020; originally announced July 2020.

    Comments: - reformatted section headings - added several citations - linted and reformatted code chunks

  10. The importance of transparency and reproducibility in artificial intelligence research

    Authors: Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, MAQC Society Board, Levi Waldron, Bo Wang, Chris McIntosh, Anshul Kundaje, Casey S. Greene, Michael M. Hoffman, Jeffrey T. Leek, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbush, Hugo J. W. L. Aerts

    Abstract: In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.

    Submitted 7 March, 2020; v1 submitted 28 February, 2020; originally announced March 2020.

    Journal ref: Nature 586 (2020) E14-E16

  11. Tools for analyzing R code the tidy way

    Authors: Lucy D'Agostino McGowan, Sean Kross, Jeffrey T. Leek

    Abstract: With the current emphasis on reproducibility and replicability, there is an increasing need to examine how data analyses are conducted. In order to analyze the between researcher variability in data analysis choices as well as the aspects within the data analysis pipeline that contribute to the variability in results, we have created two R packages: matahari and tidycode. These packages build on m… ▽ More

    Submitted 20 May, 2019; originally announced May 2019.

    Journal ref: The R Journal, 12(1), 226 (2020)

  12. arXiv:1509.08968  [pdf, other

    stat.AP

    A glass half full interpretation of the replicability of psychological science

    Authors: Jeffrey T. Leek, Prasad Patil, Roger D. Peng

    Abstract: A recent study of the replicability of key psychological findings is a major contribution toward understanding the human side of the scientific process. Despite the careful and nuanced analysis reported in the paper, mass and social media adhered to the simple narrative that only 36% of the studies replicated their original results. Here we show that 77% of the replication effect sizes reported we… ▽ More

    Submitted 29 September, 2015; originally announced September 2015.

    Comments: 6 pages, 3 figures

  13. Reproducible Research Can Still Be Wrong: Adopting a Prevention Approach

    Authors: Jeffrey T. Leek, Roger D. Peng

    Abstract: Reproducibility, the ability to recompute results, and replicability, the chances other experimenters will achieve a consistent result, are two foundational characteristics of successful scientific research. Consistent findings from independent investigators are the primary means by which scientific evidence accumulates for or against an hypothesis. And yet, of late there has been a crisis of conf… ▽ More

    Submitted 10 February, 2015; originally announced February 2015.

    Comments: 3 pages, 1 figure

    Journal ref: PNAS 112 (6) 1645-1645, 2015

  14. arXiv:1301.3947  [pdf, other

    stat.ME stat.AP stat.CO

    Removing batch effects for prediction problems with frozen surrogate variable analysis

    Authors: Hilary S. Parker, Héctor Corrada Bravo, Jeffrey T. Leek

    Abstract: Batch effects are responsible for the failure of promising genomic prognos- tic signatures, major ambiguities in published genomic results, and retractions of widely-publicized findings. Batch effect corrections have been developed to re- move these artifacts, but they are designed to be used in population studies. But genomic technologies are beginning to be used in clinical applications where sa… ▽ More

    Submitted 16 January, 2013; originally announced January 2013.

  15. arXiv:1301.3933  [pdf, other

    stat.ME q-bio.GN q-bio.QM

    Gene set bagging for estimating replicability of gene set analyses

    Authors: Andrew E. Jaffe, John D. Storey, Hongkai Ji, Jeffrey T. Leek

    Abstract: Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-th… ▽ More

    Submitted 17 January, 2013; v1 submitted 16 January, 2013; originally announced January 2013.

    Comments: 3 Figures

  16. arXiv:1301.3718  [pdf

    stat.AP

    Empirical estimates suggest most published medical research is true

    Authors: Leah R. Jager, Jeffrey T. Leek

    Abstract: The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in th… ▽ More

    Submitted 16 January, 2013; originally announced January 2013.

    Comments: 11 pages, 4 figures, Correspondance to J. Leek