-
How causal perspectives can inform neuroscience data analysis
Authors:
Eric W. Bridgeford,
Brian S. Caffo,
Maya B. Mathur,
Russell A. Poldrack
Abstract:
Over the past two decades, considerable strides have been made in advancing neuroscientific techniques, yet challenges remain in attributing causality to observed associations. This review addresses a fundamental issue in observational neuroscience studies and advocates for incorporating causal inference frameworks into standard practice. We systematically introduce necessary definitions and conce…
▽ More
Over the past two decades, considerable strides have been made in advancing neuroscientific techniques, yet challenges remain in attributing causality to observed associations. This review addresses a fundamental issue in observational neuroscience studies and advocates for incorporating causal inference frameworks into standard practice. We systematically introduce necessary definitions and concepts, emphasizing how causal assumptions underlie statistical analyses even when not explicitly stated. Through a running example on sleep quality and white matter integrity, we illustrate how persistent challenges, including confounding and selection biases, can be conceptualized and addressed using causal frameworks. We demonstrate practical approaches for making assumption violations transparent through hands-on examples: supplementary case studies using multi-site harmonization and head motion exclusion procedures provide step-by-step diagnostic techniques for checking covariate overlap and identifying selection bias through exclusion pattern analysis. We explore how these causal perspectives can inform both experimental design and analytical choices, particularly for observational studies where traditional randomization is infeasible. Together, we believe this framework offers concrete tools for strengthening causal interpretations and inspiring more robust approaches to problems in neuroscience.
△ Less
Submitted 4 September, 2025; v1 submitted 12 March, 2025;
originally announced March 2025.
-
Learning sources of variability from high-dimensional observational studies
Authors:
Eric W. Bridgeford,
Jaewon Chung,
Brian Gilbert,
Sambit Panda,
Adam Li,
Cencheng Shen,
Alexandra Badea,
Brian Caffo,
Joshua T. Vogelstein
Abstract:
Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estiman…
▽ More
Causal inference studies whether the presence of a variable influences an observed outcome. As measured by quantities such as the "average treatment effect," this paradigm is employed across numerous biological fields, from vaccine and drug development to policy interventions. Unfortunately, the majority of these methods are often limited to univariate outcomes. Our work generalizes causal estimands to outcomes with any number of dimensions or any measurable space, and formulates traditional causal estimands for nominal variables as causal discrepancy tests. We propose a simple technique for adjusting universally consistent conditional independence tests and prove that these tests are universally consistent causal discrepancy tests. Numerical experiments illustrate that our method, Causal CDcorr, leads to improvements in both finite sample validity and power when compared to existing strategies. Our methods are all open source and available at github.com/ebridge2/cdcorr.
△ Less
Submitted 28 November, 2023; v1 submitted 25 July, 2023;
originally announced July 2023.
-
Multiscale Comparative Connectomics
Authors:
Vivek Gopalakrishnan,
Jaewon Chung,
Eric Bridgeford,
Benjamin D. Pedigo,
Jesús Arroyo,
Lucy Upchurch,
G. Allan Johnson,
Nian Wang,
Youngser Park,
Carey E. Priebe,
Joshua T. Vogelstein
Abstract:
The connectome, a map of the structural and/or functional connections in the brain, provides a complex representation of the neurobiological phenotypes on which it supervenes. This information-rich data modality has the potential to transform our understanding of the relationship between patterns in brain connectivity and neurological processes, disorders, and diseases. However, existing computati…
▽ More
The connectome, a map of the structural and/or functional connections in the brain, provides a complex representation of the neurobiological phenotypes on which it supervenes. This information-rich data modality has the potential to transform our understanding of the relationship between patterns in brain connectivity and neurological processes, disorders, and diseases. However, existing computational techniques used to analyze connectomes are oftentimes insufficient for interrogating multi-subject connectomics datasets: many current methods are either solely designed to analyze single connectomes or leverage heuristic graph statistics that are unable to capture the complete topology of multiscale connections between brain regions. To enable more rigorous connectomics analysis, we introduce a set of robust and interpretable effect size measures motivated by recent theoretical advances in random graph models. These measures facilitate simultaneous analysis of multiple connectomes across different scales of network topology, enabling the robust and reproducible discovery of hierarchical brain structures that vary in relation to phenotypic profiles. In addition to explaining the theoretical foundations and guarantees of our algorithms, we demonstrate their superiority over current state-of-the-art connectomics methods through extensive simulation studies and real-data experiments. Using a set of high-resolution connectomes obtained from genetically distinct mouse strains (including the BTBR mouse -- a standard model of autism -- and three behavioral wild-types), we illustrate how our methods successfully uncover latent information in multi-subject connectomics data and yield valuable insights into the connective correlates of neurological phenotypes that other methods do not capture. The data and code necessary to reproduce our analyses are available at https://github.com/neurodata/MCC.
△ Less
Submitted 2 December, 2024; v1 submitted 30 November, 2020;
originally announced November 2020.
-
Statistical Analysis of Data Repeatability Measures
Authors:
Zeyi Wang,
Eric Bridgeford,
Shangsi Wang,
Joshua T. Vogelstein,
Brian Caffo
Abstract:
The advent of modern data collection and processing techniques has seen the size, scale, and complexity of data grow exponentially. A seminal step in leveraging these rich datasets for downstream inference is understanding the characteristics of the data which are repeatable -- the aspects of the data that are able to be identified under a duplicated analysis. Conflictingly, the utility of traditi…
▽ More
The advent of modern data collection and processing techniques has seen the size, scale, and complexity of data grow exponentially. A seminal step in leveraging these rich datasets for downstream inference is understanding the characteristics of the data which are repeatable -- the aspects of the data that are able to be identified under a duplicated analysis. Conflictingly, the utility of traditional repeatability measures, such as the intraclass correlation coefficient, under these settings is limited. In recent work, novel data repeatability measures have been introduced in the context where a set of subjects are measured twice or more, including: fingerprinting, rank sums, and generalizations of the intraclass correlation coefficient. However, the relationships between, and the best practices among these measures remains largely unknown. In this manuscript, we formalize a novel repeatability measure, discriminability. We show that it is deterministically linked with the correlation coefficient under univariate random effect models, and has desired property of optimal accuracy for inferential tasks using multivariate measurements. Additionally, we overview and systematically compare repeatability statistics using both theoretical results and simulations. We show that the rank sum statistic is deterministically linked to a consistent estimator of discriminability. The power of permutation tests derived from these measures are compared numerically under Gaussian and non-Gaussian settings, with and without simulated batch effects. Motivated by both theoretical and empirical results, we provide methodological recommendations for each benchmark setting to serve as a resource for future analyses. We believe these recommendations will play an important role towards improving repeatability in fields such as functional magnetic resonance imaging, genomics, pharmacology, and more.
△ Less
Submitted 28 July, 2024; v1 submitted 24 May, 2020;
originally announced May 2020.
-
hyppo: A Multivariate Hypothesis Testing Python Package
Authors:
Sambit Panda,
Satish Palaniappan,
Junhao Xiong,
Eric W. Bridgeford,
Ronak Mehta,
Cencheng Shen,
Joshua T. Vogelstein
Abstract:
We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible eno…
▽ More
We introduce hyppo, a unified library for performing multivariate hypothesis testing, including independence, two-sample, and k-sample testing. While many multivariate independence tests have R packages available, the interfaces are inconsistent and most are not available in Python. hyppo includes many state of the art multivariate testing procedures. The package is easy-to-use and is flexible enough to enable future extensions. The documentation and all releases are available at https://hyppo.neurodata.io.
△ Less
Submitted 12 September, 2024; v1 submitted 3 July, 2019;
originally announced July 2019.
-
GraSPy: Graph Statistics in Python
Authors:
Jaewon Chung,
Benjamin D. Pedigo,
Eric W. Bridgeford,
Bijan K. Varjavand,
Hayden S. Helm,
Joshua T. Vogelstein
Abstract:
We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The…
▽ More
We introduce GraSPy, a Python library devoted to statistical inference, machine learning, and visualization of random graphs and graph populations. This package provides flexible and easy-to-use algorithms for analyzing and understanding graphs with a scikit-learn compliant API. GraSPy can be downloaded from Python Package Index (PyPi), and is released under the Apache 2.0 open-source license. The documentation and all releases are available at https://neurodata.io/graspy.
△ Less
Submitted 14 August, 2019; v1 submitted 29 March, 2019;
originally announced April 2019.
-
On a 'Two Truths' Phenomenon in Spectral Graph Clustering
Authors:
Carey E. Priebe,
Youngser Park,
Joshua T. Vogelstein,
John M. Conroy,
Vince Lyzinski,
Minh Tang,
Avanti Athreya,
Joshua Cape,
Eric Bridgeford
Abstract:
Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical resu…
▽ More
Clustering is concerned with coherently grouping observations without any explicit concept of true groupings. Spectral graph clustering - clustering the vertices of a graph based on their spectral embedding - is commonly approached via K-means (or, more generally, Gaussian mixture model) clustering composed with either Laplacian or Adjacency spectral embedding (LSE or ASE). Recent theoretical results provide new understanding of the problem and solutions, and lead us to a 'Two Truths' LSE vs. ASE spectral graph clustering phenomenon convincingly illustrated here via a diffusion MRI connectome data set: the different embedding methods yield different clustering results, with LSE capturing left hemisphere/right hemisphere affinity structure and ASE capturing gray matter/white matter core-periphery structure.
△ Less
Submitted 11 February, 2019; v1 submitted 23 August, 2018;
originally announced August 2018.
-
Vertex nomination: The canonical sampling and the extended spectral nomination schemes
Authors:
Jordan Yoder,
Li Chen,
Henry Pao,
Eric Bridgeford,
Keith Levin,
Donniell Fishkind,
Carey Priebe,
Vince Lyzinski
Abstract:
Suppose that one particular block in a stochastic block model is of interest, but block labels are only observed for a few of the vertices in the network. Utilizing a graph realized from the model and the observed block labels, the vertex nomination task is to order the vertices with unobserved block labels into a ranked nomination list with the goal of having an abundance of interesting vertices…
▽ More
Suppose that one particular block in a stochastic block model is of interest, but block labels are only observed for a few of the vertices in the network. Utilizing a graph realized from the model and the observed block labels, the vertex nomination task is to order the vertices with unobserved block labels into a ranked nomination list with the goal of having an abundance of interesting vertices near the top of the list. There are vertex nomination schemes in the literature, including the optimally precise canonical nomination scheme~$\mathcal{L}^C$ and the consistent spectral partitioning nomination scheme~$\mathcal{L}^P$. While the canonical nomination scheme $\mathcal{L}^C$ is provably optimally precise, it is computationally intractable, being impractical to implement even on modestly sized graphs.
With this in mind, an approximation of the canonical scheme---denoted the {\it canonical sampling nomination scheme} $\mathcal{L}^{CS}$---is introduced; $\mathcal{L}^{CS}$ relies on a scalable, Markov chain Monte Carlo-based approximation of $\mathcal{L}^{C}$, and converges to $\mathcal{L}^{C}$ as the amount of sampling goes to infinity. The spectral partitioning nomination scheme is also extended to the {\it extended spectral partitioning nomination scheme}, $\mathcal{L}^{EP}$, which introduces a novel semisupervised clustering framework to improve upon the precision of $\mathcal{L}^P$. Real-data and simulation experiments are employed to illustrate the precision of these vertex nomination schemes, as well as their empirical computational complexity.
Keywords: vertex nomination, Markov chain Monte Carlo, spectral partitioning, Mclust
MSC[2010]: 60J22, 65C40, 62H30, 62H25
△ Less
Submitted 22 January, 2020; v1 submitted 14 February, 2018;
originally announced February 2018.
-
Supervised Dimensionality Reduction for Big Data
Authors:
Joshua T. Vogelstein,
Eric Bridgeford,
Minh Tang,
Da Zheng,
Christopher Douville,
Randal Burns,
Mauro Maggioni
Abstract:
To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation tha…
▽ More
To solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees.We introduce an approach, XOX, to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest ver-sion, "Linear Optimal Low-rank" projection (LOL), incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that LOL and its generalizations in the XOX framework lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of >150 million features, and several genomics datasets with>500,000 features, LOL outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.
△ Less
Submitted 23 January, 2021; v1 submitted 5 September, 2017;
originally announced September 2017.
-
Discovering and Deciphering Relationships Across Disparate Data Modalities
Authors:
Joshua T. Vogelstein,
Eric Bridgeford,
Qing Wang,
Carey E. Priebe,
Mauro Maggioni,
Cencheng Shen
Abstract:
Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets. While existing approaches can test whether two properties are related, they often require unfeasibly large sample sizes in real data scenarios, and do not provide any insight into how or…
▽ More
Understanding the relationships between different properties of data, such as whether a connectome or genome has information about disease status, is becoming increasingly important in modern biological datasets. While existing approaches can test whether two properties are related, they often require unfeasibly large sample sizes in real data scenarios, and do not provide any insight into how or why the procedure reached its decision. Our approach, "Multiscale Graph Correlation" (MGC), is a dependence test that juxtaposes previously disparate data science techniques, including k-nearest neighbors, kernel methods (such as support vector machines), and multiscale analysis (such as wavelets). Other methods typically require double or triple the number samples to achieve the same statistical power as MGC in a benchmark suite including high-dimensional and nonlinear relationships - spanning polynomial (linear, quadratic, cubic), trigonometric (sinusoidal, circular, ellipsoidal, spiral), geometric (square, diamond, W-shape), and other functions, with dimensionality ranging from 1 to 1000. Moreover, MGC uniquely provides a simple and elegant characterization of the potentially complex latent geometry underlying the relationship, providing insight while maintaining computational efficiency. In several real data applications, including brain imaging and cancer genetics, MGC is the only method that can both detect the presence of a dependency and provide specific guidance for the next experiment and/or analysis to conduct.
△ Less
Submitted 6 December, 2018; v1 submitted 16 September, 2016;
originally announced September 2016.