-
ipd: An R Package for Conducting Inference on Predicted Data
Authors:
Stephen Salerno,
Jiacheng Miao,
Awan Afiaz,
Kentaro Hoffman,
Anna Neufeld,
Qiongshi Lu,
Tyler H. McCormick,
Jeffrey T. Leek
Abstract:
Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapp…
▽ More
Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage. Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage `vignette' are available at github.com/ipd-tools/ipd. Contact: [email protected] and [email protected]
△ Less
Submitted 12 October, 2024;
originally announced October 2024.
-
From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives
Authors:
Shuxian Fan,
Adam Visokay,
Kentaro Hoffman,
Stephen Salerno,
Li Liu,
Jeffrey T. Leek,
Tyler H. McCormick
Abstract:
In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii…
▽ More
In settings where most deaths occur outside the healthcare system, verbal autopsies (VAs) are a common tool to monitor trends in causes of death (COD). VAs are interviews with a surviving caregiver or relative that are used to predict the decedent's COD. Turning VAs into actionable insights for researchers and policymakers requires two steps (i) predicting likely COD using the VA interview and (ii) performing inference with predicted CODs (e.g. modeling the breakdown of causes by demographic factors using a sample of deaths). In this paper, we develop a method for valid inference using outcomes (in our case COD) predicted from free-form text using state-of-the-art NLP techniques. This method, which we call multiPPI++, extends recent work in "prediction-powered inference" to multinomial classification. We leverage a suite of NLP techniques for COD prediction and, through empirical analysis of VA data, demonstrate the effectiveness of our approach in handling transportability issues. multiPPI++ recovers ground truth estimates, regardless of which NLP model produced predictions and regardless of whether they were produced by a more accurate predictor like GPT-4-32k or a less accurate predictor like KNN. Our findings demonstrate the practical importance of inference correction for public health decision-making and suggests that if inference tasks are the end goal, having a small amount of contextually relevant, high quality labeled data is essential regardless of the NLP algorithm.
△ Less
Submitted 2 April, 2024;
originally announced April 2024.
-
Do We Really Even Need Data?
Authors:
Kentaro Hoffman,
Stephen Salerno,
Awan Afiaz,
Jeffrey T. Leek,
Tyler H. McCormick
Abstract:
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association b…
▽ More
As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g. rising costs, declining survey response rates), researchers increasingly use predictions from pre-trained algorithms as outcome variables. Though appealing for financial and logistical reasons, using standard tools for inference can misrepresent the association between independent variables and the outcome of interest when the true, unobserved outcome is replaced by a predicted value. In this paper, we characterize the statistical challenges inherent to this so-called ``inference with predicted data'' problem and elucidate three potential sources of error: (i) the relationship between predicted outcomes and their true, unobserved counterparts, (ii) robustness of the machine learning model to resampling or uncertainty about the training data, and (iii) appropriately propagating not just bias but also uncertainty from predictions into the ultimate inference procedure.
△ Less
Submitted 2 February, 2024; v1 submitted 14 January, 2024;
originally announced January 2024.
-
Evaluation of software impact designed for biomedical research: Are we measuring what's meaningful?
Authors:
Awan Afiaz,
Andrey Ivanov,
John Chamberlin,
David Hanauer,
Candace Savonen,
Mary J Goldman,
Martin Morgan,
Michael Reich,
Alexander Getka,
Aaron Holmes,
Sarthak Pati,
Dan Knight,
Paul C. Boutros,
Spyridon Bakas,
J. Gregory Caporaso,
Guilherme Del Fiol,
Harry Hochheiser,
Brian Haas,
Patrick D. Schloss,
James A. Eddy,
Jake Albrecht,
Andrey Fedorov,
Levi Waldron,
Ava M. Hoffman,
Richard L. Bradshaw
, et al. (2 additional authors not shown)
Abstract:
Software is vital for the advancement of biology and medicine. Analysis of usage and impact metrics can help developers determine user and community engagement, justify additional funding, encourage additional use, identify unanticipated use cases, and help define improvement areas. However, there are challenges associated with these analyses including distorted or misleading metrics, as well as e…
▽ More
Software is vital for the advancement of biology and medicine. Analysis of usage and impact metrics can help developers determine user and community engagement, justify additional funding, encourage additional use, identify unanticipated use cases, and help define improvement areas. However, there are challenges associated with these analyses including distorted or misleading metrics, as well as ethical and security concerns. More attention to the nuances involved in capturing impact across the spectrum of biological software is needed. Furthermore, some tools may be especially beneficial to a small audience, yet may not have compelling typical usage metrics. We propose more general guidelines, as well as strategies for more specific types of software. We highlight outstanding issues regarding how communities measure or evaluate software impact. To get a deeper understanding of current practices for software evaluations, we performed a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We also investigated software among this community and others to assess how often infrastructure that supports such evaluations is implemented and how this impacts rates of papers describing usage of the software. We find that developers recognize the utility of analyzing software usage, but struggle to find the time or funding for such analyses. We also find that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seem to be associated with increased usage rates. Our findings can help scientific software developers make the most out of evaluations of their software.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Motivation, inclusivity, and realism should drive data science education
Authors:
Candace Savonen,
Carrie Wright,
Ava M. Hoffman,
Elizabeth M. Humphries,
Katherine E. L. Cox,
Frederick J. Tan,
Jeffrey T. Leek
Abstract:
Data science education provides tremendous opportunities but remains inaccessible to many communities. Increasing the accessibility of data science to these communities not only benefits the individuals entering data science, but also increases the field's innovation and potential impact as a whole. Education is the most scalable solution to meet these needs, but many data science educators lack f…
▽ More
Data science education provides tremendous opportunities but remains inaccessible to many communities. Increasing the accessibility of data science to these communities not only benefits the individuals entering data science, but also increases the field's innovation and potential impact as a whole. Education is the most scalable solution to meet these needs, but many data science educators lack formal training in education. Our group has led education efforts for a variety of audiences: from professional scientists to high school students to lay audiences. These experiences have helped form our teaching philosophy which we have summarized into three main ideals: 1) motivation, 2) inclusivity, and 3) realism. To put these ideals better into practice, we also aim to iteratively update our teaching approaches and curriculum as we find ways to better reach these ideals. In this manuscript we discuss these ideals as well practical ideas for how to implement these philosophies in the classroom.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Open-source Tools for Training Resources -- OTTR
Authors:
Candace Savonen,
Carrie Wright,
Ava M. Hoffman,
John Muschelli,
Katherine Cox,
Frederick J. Tan,
Jeffrey T. Leek
Abstract:
Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resource…
▽ More
Data science and informatics tools are developing at a blistering rate, but their users often lack the educational background or resources to efficiently apply the methods to their research. Training resources often deprecate because their maintenance is not prioritized by funding, giving teams little time to devote to such endeavors. Our group has developed Open-source Tools for Training Resources (OTTR) to offer greater efficiency and flexibility for creating and maintaining online course content. OTTR empowers creators to customize their work and allows for a simple workflow to publish using multiple platforms. OTTR allows content creators to publish material to multiple massive online learner communities using familiar rendering mechanics. OTTR allows the incorporation of pedagogical practices like formative and summative assessments in the form of multiple choice questions and fill in the blank problems that are automatically graded. No local installation of any software is required to begin creating content with OTTR. Thus far, 15 courses have been created with OTTR repository template. By using the OTTR system, the maintenance workload for updating these courses across platforms has been drastically reduced.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
Diversifying the Genomic Data Science Research Community
Authors:
The Genomic Data Science Community Network,
Rosa Alcazar,
Maria Alvarez,
Rachel Arnold,
Mentewab Ayalew,
Lyle G. Best,
Michael C. Campbell,
Kamal Chowdhury,
Katherine E. L. Cox,
Christina Daulton,
Youping Deng,
Carla Easter,
Karla Fuller,
Shazia Tabassum Hakim,
Ava M. Hoffman,
Natalie Kucher,
Andrew Lee,
Joslynn Lee,
Jeffrey T. Leek,
Robert Meller,
Loyda B. Méndez,
Miguel P. Méndez-González,
Stephen Mosher,
Michele Nishiguchi,
Siddharth Pratap
, et al. (13 additional authors not shown)
Abstract:
Over the last 20 years, there has been an explosion of genomic data collected for disease association, functional analyses, and other large-scale discoveries. At the same time, there have been revolutions in cloud computing that enable computational and data science research, while making data accessible to anyone with a web browser and an internet connection. However, students at institutions wit…
▽ More
Over the last 20 years, there has been an explosion of genomic data collected for disease association, functional analyses, and other large-scale discoveries. At the same time, there have been revolutions in cloud computing that enable computational and data science research, while making data accessible to anyone with a web browser and an internet connection. However, students at institutions with limited resources have received relatively little exposure to curricula or professional development opportunities that lead to careers in genomic data science. To broaden participation in genomics research, the scientific community needs to support students, faculty, and administrators at Underserved Institutions (UIs) including Community Colleges, Historically Black Colleges and Universities, Hispanic-Serving Institutions, and Tribal Colleges and Universities in taking advantage of these tools in local educational and research programs. We have formed the Genomic Data Science Community Network (http://www.gdscn.org/) to identify opportunities and support broadening access to cloud-enabled genomic data science. Here, we provide a summary of the priorities for faculty members at UIs, as well as administrators, funders, and R1 researchers to consider as we create a more diverse genomic data science community.
△ Less
Submitted 9 June, 2022; v1 submitted 20 January, 2022;
originally announced January 2022.
-
Linking open-source code commits and MOOC grades to evaluate massive online open peer review
Authors:
Siruo Wang,
Leah R. Jager,
Kai Kammers,
Aboozar Hadavand,
Jeffrey T. Leek
Abstract:
Massive Open Online Courses (MOOCs) have been used by students as a low-cost and low-touch educational credential in a variety of fields. Understanding the grading mechanisms behind these course assignments is important for evaluating MOOC credentials. A common approach to grading free-response assignments is massive scale peer-review, especially used for assignments that are not easy to grade pro…
▽ More
Massive Open Online Courses (MOOCs) have been used by students as a low-cost and low-touch educational credential in a variety of fields. Understanding the grading mechanisms behind these course assignments is important for evaluating MOOC credentials. A common approach to grading free-response assignments is massive scale peer-review, especially used for assignments that are not easy to grade programmatically. It is difficult to assess these approaches since the responses typically require human evaluation. Here we link data from public code repositories on GitHub and course grades for a large massive-online open course to study the dynamics of massive scale peer review. This has important implications for understanding the dynamics of difficult to grade assignments. Since the research was not hypothesis-driven, we described the results in an exploratory framework. We find three distinct clusters of repeated peer-review submissions and use these clusters to study how grades change in response to changes in code submissions. Our exploration also leads to an important observation that massive scale peer-review scores are highly variable, increase, on average, with repeated submissions, and changes in scores are not closely tied to the code changes that form the basis for the re-submissions.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Ari: The Automated R Instructor
Authors:
Sean Kross,
Jeffrey T. Leek,
John Muschelli
Abstract:
We present the ari package for automatically generating technology-focused educational videos. The goal of the package is to create reproducible videos, with the ability to change and update video content seamlessly. We present several examples of generating videos including using R Markdown slide decks, PowerPoint slides, or simple images as source material. We also discuss how ari can help instr…
▽ More
We present the ari package for automatically generating technology-focused educational videos. The goal of the package is to create reproducible videos, with the ability to change and update video content seamlessly. We present several examples of generating videos including using R Markdown slide decks, PowerPoint slides, or simple images as source material. We also discuss how ari can help instructors reach new audiences through programmatically translating materials into other languages.
△ Less
Submitted 4 August, 2020; v1 submitted 27 May, 2020;
originally announced July 2020.
-
The importance of transparency and reproducibility in artificial intelligence research
Authors:
Benjamin Haibe-Kains,
George Alexandru Adam,
Ahmed Hosny,
Farnoosh Khodakarami,
MAQC Society Board,
Levi Waldron,
Bo Wang,
Chris McIntosh,
Anshul Kundaje,
Casey S. Greene,
Michael M. Hoffman,
Jeffrey T. Leek,
Wolfgang Huber,
Alvis Brazma,
Joelle Pineau,
Robert Tibshirani,
Trevor Hastie,
John P. A. Ioannidis,
John Quackenbush,
Hugo J. W. L. Aerts
Abstract:
In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.
In their study, McKinney et al. showed the high potential of artificial intelligence for breast cancer screening. However, the lack of detailed methods and computer code undermines its scientific value. We identify obstacles hindering transparent and reproducible AI research as faced by McKinney et al and provide solutions with implications for the broader field.
△ Less
Submitted 7 March, 2020; v1 submitted 28 February, 2020;
originally announced March 2020.
-
Tools for analyzing R code the tidy way
Authors:
Lucy D'Agostino McGowan,
Sean Kross,
Jeffrey T. Leek
Abstract:
With the current emphasis on reproducibility and replicability, there is an increasing need to examine how data analyses are conducted. In order to analyze the between researcher variability in data analysis choices as well as the aspects within the data analysis pipeline that contribute to the variability in results, we have created two R packages: matahari and tidycode. These packages build on m…
▽ More
With the current emphasis on reproducibility and replicability, there is an increasing need to examine how data analyses are conducted. In order to analyze the between researcher variability in data analysis choices as well as the aspects within the data analysis pipeline that contribute to the variability in results, we have created two R packages: matahari and tidycode. These packages build on methods created for natural language processing; rather than allowing for the processing of natural language, we focus on R code as the substrate of interest. The matahari package facilitates the logging of everything that is typed in the R console or in an R script in a tidy data frame. The tidycode package contains tools to allow for analyzing R calls in a tidy manner. We demonstrate the utility of these packages as well as walk through two examples.
△ Less
Submitted 20 May, 2019;
originally announced May 2019.
-
A glass half full interpretation of the replicability of psychological science
Authors:
Jeffrey T. Leek,
Prasad Patil,
Roger D. Peng
Abstract:
A recent study of the replicability of key psychological findings is a major contribution toward understanding the human side of the scientific process. Despite the careful and nuanced analysis reported in the paper, mass and social media adhered to the simple narrative that only 36% of the studies replicated their original results. Here we show that 77% of the replication effect sizes reported we…
▽ More
A recent study of the replicability of key psychological findings is a major contribution toward understanding the human side of the scientific process. Despite the careful and nuanced analysis reported in the paper, mass and social media adhered to the simple narrative that only 36% of the studies replicated their original results. Here we show that 77% of the replication effect sizes reported were within a prediction interval based on the original effect size. In this light, the results of Reproducibility Project: Psychology can be viewed as a positive result for the scientific process.
△ Less
Submitted 29 September, 2015;
originally announced September 2015.
-
Reproducible Research Can Still Be Wrong: Adopting a Prevention Approach
Authors:
Jeffrey T. Leek,
Roger D. Peng
Abstract:
Reproducibility, the ability to recompute results, and replicability, the chances other experimenters will achieve a consistent result, are two foundational characteristics of successful scientific research. Consistent findings from independent investigators are the primary means by which scientific evidence accumulates for or against an hypothesis. And yet, of late there has been a crisis of conf…
▽ More
Reproducibility, the ability to recompute results, and replicability, the chances other experimenters will achieve a consistent result, are two foundational characteristics of successful scientific research. Consistent findings from independent investigators are the primary means by which scientific evidence accumulates for or against an hypothesis. And yet, of late there has been a crisis of confidence among researchers worried about the rate at which studies are either reproducible or replicable. In order to maintain the integrity of science research and maintain the public's trust in science, the scientific community must ensure reproducibility and replicability by engaging in a more preventative approach that greatly expands data analysis education and routinely employs software tools.
△ Less
Submitted 10 February, 2015;
originally announced February 2015.
-
Removing batch effects for prediction problems with frozen surrogate variable analysis
Authors:
Hilary S. Parker,
Héctor Corrada Bravo,
Jeffrey T. Leek
Abstract:
Batch effects are responsible for the failure of promising genomic prognos- tic signatures, major ambiguities in published genomic results, and retractions of widely-publicized findings. Batch effect corrections have been developed to re- move these artifacts, but they are designed to be used in population studies. But genomic technologies are beginning to be used in clinical applications where sa…
▽ More
Batch effects are responsible for the failure of promising genomic prognos- tic signatures, major ambiguities in published genomic results, and retractions of widely-publicized findings. Batch effect corrections have been developed to re- move these artifacts, but they are designed to be used in population studies. But genomic technologies are beginning to be used in clinical applications where sam- ples are analyzed one at a time for diagnostic, prognostic, and predictive applica- tions. There are currently no batch correction methods that have been developed specifically for prediction. In this paper, we propose an new method called frozen surrogate variable analysis (fSVA) that borrows strength from a training set for individual sample batch correction. We show that fSVA improves prediction ac- curacy in simulations and in public genomic studies. fSVA is available as part of the sva Bioconductor package.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.
-
Gene set bagging for estimating replicability of gene set analyses
Authors:
Andrew E. Jaffe,
John D. Storey,
Hongkai Ji,
Jeffrey T. Leek
Abstract:
Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-th…
▽ More
Background: Significance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features for association with disease. We propose a new approach, called gene set bagging, for measuring the stability of ranking procedures using predefined gene sets. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate. This procedure can be thought of as bootstrapping gene-set analysis and can be used to determine which are the most reproducible gene sets. Results: Here we apply this approach to two common genomics applications: gene expression and DNA methylation. Even with state-of-the-art statistical ranking procedures, significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. Conclusions: We demonstrate that gene lists are not necessarily stable, and therefore additional steps like gene set bagging can improve biological inference of gene set analysis.
△ Less
Submitted 17 January, 2013; v1 submitted 16 January, 2013;
originally announced January 2013.
-
Empirical estimates suggest most published medical research is true
Authors:
Leah R. Jager,
Jeffrey T. Leek
Abstract:
The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in th…
▽ More
The accuracy of published medical research is critical both for scientists, physicians and patients who rely on these results. But the fundamental belief in the medical literature was called into serious question by a paper suggesting most published medical research is false. Here we adapt estimation methods from the genomics community to the problem of estimating the rate of false positives in the medical literature using reported P-values as the data. We then collect P-values from the abstracts of all 77,430 papers published in The Lancet, The Journal of the American Medical Association, The New England Journal of Medicine, The British Medical Journal, and The American Journal of Epidemiology between 2000 and 2010. We estimate that the overall rate of false positives among reported results is 14% (s.d. 1%), contrary to previous claims. We also find there is not a significant increase in the estimated rate of reported false positive results over time (0.5% more FP per year, P = 0.18) or with respect to journal submissions (0.1% more FP per 100 submissions, P = 0.48). Statistical analysis must allow for false positives in order to make claims on the basis of noisy data. But our analysis suggests that the medical literature remains a reliable record of scientific progress.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.