-
rtables -- A Framework For Creating Complex Structured Reporting Tables Via Multi-Level Faceted Computations
Authors:
Gabriel Becker,
Adrian Waddell
Abstract:
Tables form a central component in both exploratory data analysis and formal reporting procedures across many industries. These tables are often complex in their conceptual structure and in the computations that generate their individual cell values. We introduce both a conceptual framework and a reference implementation for declaring, generating, rendering and modeling such tables. We place table…
▽ More
Tables form a central component in both exploratory data analysis and formal reporting procedures across many industries. These tables are often complex in their conceptual structure and in the computations that generate their individual cell values. We introduce both a conceptual framework and a reference implementation for declaring, generating, rendering and modeling such tables. We place tables within the existing grammar of graphics paradigm for general statistical visualizations. Our open source `rtables` software implementation utilizes these connections to facilitate an intuitive way to declare complex table structure and construct those tables from data. In the course of this work, we relax several constraints present in the traditional grammar of graphics framing. Finally, `rtables` models instantiated tables as tree structures, which allows powerful, semantically meaningful and self-describing queries and manipulations of tables after creation. We showcase our framework in practice by creating complex, realistic example tables.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Quantifying Confounding Bias in Neuroimaging Datasets with Causal Inference
Authors:
Christian Wachinger,
Benjamin Gutierrez Becker,
Anna Rieckmann,
Sebastian Pölsterl
Abstract:
Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex machine learning models. A potential solution is to increase sample size by pooling scans from several datasets. In this work, we combine 12,207 MRI scans from 15 studies and show that simple pooling is often ill-advised due…
▽ More
Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex machine learning models. A potential solution is to increase sample size by pooling scans from several datasets. In this work, we combine 12,207 MRI scans from 15 studies and show that simple pooling is often ill-advised due to introducing various types of biases in the training data. First, we systematically define these biases. Second, we detect bias by experimentally showing that scans can be correctly assigned to their respective dataset with 73.3% accuracy. Finally, we propose to tell causal from confounding factors by quantifying the extent of confounding and causality in a single dataset using causal inference. We achieve this by finding the simplest graphical model in terms of Kolmogorov complexity. As Kolmogorov complexity is not directly computable, we employ the minimum description length to approximate it. We empirically show that our approach is able to estimate plausible causal relationships from real neuroimaging data.
△ Less
Submitted 9 July, 2019;
originally announced July 2019.
-
trackr: A Framework for Enhancing Discoverability and Reproducibility of Data Visualizations and Other Artifacts in R
Authors:
Gabriel Becker,
Sara E. Moore,
Michael Lawrence
Abstract:
Research is an incremental, iterative process, with new results relying and building upon previous ones. Scientists need to find, retrieve, understand, and verify results in order to confidently extend them, even when the results are their own. We present the trackr framework for organizing, automatically annotating, discovering, and retrieving results. We identify sources of automatically extract…
▽ More
Research is an incremental, iterative process, with new results relying and building upon previous ones. Scientists need to find, retrieve, understand, and verify results in order to confidently extend them, even when the results are their own. We present the trackr framework for organizing, automatically annotating, discovering, and retrieving results. We identify sources of automatically extractable metadata for computational results, and we define an extensible system for organizing, annotating, and searching for results based on these and other metadata. We present an open-source implementation of these concepts for plots, computational artifacts, and woven dynamic reports generated in the R statistical computing language.
△ Less
Submitted 13 June, 2017;
originally announced June 2017.
-
Enhancing reproducibility and collaboration via management of R package cohorts
Authors:
Gabriel Becker,
Cory Barr,
Robert Gentleman,
Michael Lawrence
Abstract:
Science depends on collaboration, result reproduction, and the development of supporting software tools. Each of these requires careful management of software versions. We present a unified model for installing, managing, and publishing software contexts in R. It introduces the package manifest as a central data structure for representing version specific, decentralized package cohorts. The manife…
▽ More
Science depends on collaboration, result reproduction, and the development of supporting software tools. Each of these requires careful management of software versions. We present a unified model for installing, managing, and publishing software contexts in R. It introduces the package manifest as a central data structure for representing version specific, decentralized package cohorts. The manifest points to package sources on arbitrary hosts and in various forms, including tarballs and directories under version control. We provide a high-level interface for creating and switching between side-by-side package libraries derived from manifests. Finally, we extend package installation to support the retrieval of exact package versions as indicated by manifests, and to maintain provenance for installed packages. The provenance information enables the user to publish libraries or sessions as manifests, hence completing the loop between publication and deployment. We have implemented this model across two software packages, switchr and GRANbase, and have released the source code under the Artistic 2.0 license.
△ Less
Submitted 14 January, 2015; v1 submitted 9 January, 2015;
originally announced January 2015.