-
Label scarcity in biomedicine: Data-rich latent factor discovery enhances phenotype prediction
Authors:
Marc-Andre Schulz,
Bertrand Thirion,
Alexandre Gramfort,
Gaël Varoquaux,
Danilo Bzdok
Abstract:
High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be derived from the UK Biobank population dataset and used to enhance data-scarce prediction of health in…
▽ More
High-quality data accumulation is now becoming ubiquitous in the health domain. There is increasing opportunity to exploit rich data from normal subjects to improve supervised estimators in specific diseases with notorious data scarcity. We demonstrate that low-dimensional embedding spaces can be derived from the UK Biobank population dataset and used to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics. Phenotype predictions facilitated by Variational Autoencoder manifolds typically scaled better with increasing unlabeled data than dimensionality reduction by PCA or Isomap. Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
△ Less
Submitted 12 October, 2021;
originally announced October 2021.
-
Clusters in Explanation Space: Inferring disease subtypes from model explanations
Authors:
Marc-Andre Schulz,
Matt Chapman-Rounds,
Manisha Verma,
Danilo Bzdok,
Konstantinos Georgatzis
Abstract:
Identification of disease subtypes and corresponding biomarkers can substantially improve clinical diagnosis and treatment selection. Discovering these subtypes in noisy, high dimensional biomedical data is often impossible for humans and challenging for machines. We introduce a new approach to facilitate the discovery of disease subtypes: Instead of analyzing the original data, we train a diagnos…
▽ More
Identification of disease subtypes and corresponding biomarkers can substantially improve clinical diagnosis and treatment selection. Discovering these subtypes in noisy, high dimensional biomedical data is often impossible for humans and challenging for machines. We introduce a new approach to facilitate the discovery of disease subtypes: Instead of analyzing the original data, we train a diagnostic classifier (healthy vs. diseased) and extract instance-wise explanations for the classifier's decisions. The distribution of instances in the explanation space of our diagnostic classifier amplifies the different reasons for belonging to the same class - resulting in a representation that is uniquely useful for discovering latent subtypes. We compare our ability to recover subtypes via cluster analysis on model explanations to classical cluster analysis on the original data. In multiple datasets with known ground-truth subclasses, most compellingly on UK Biobank brain imaging data and transcriptome data from the Cancer Genome Atlas, we show that cluster analysis on model explanations substantially outperforms the classical approach. While we believe clustering in explanation space to be particularly valuable for inferring disease subtypes, the method is more general and applicable to any kind of sub-type identification.
△ Less
Submitted 14 May, 2020; v1 submitted 18 December, 2019;
originally announced December 2019.
-
Exploration, inference and prediction in neuroscience and biomedicine
Authors:
Danilo Bzdok,
John Ioannidis
Abstract:
The last decades saw dramatic progress in brain research. These advances were often buttressed by probing single variables to make circumscribed discoveries, typically through null hypothesis significance testing. New ways for generating massive data fueled tension between the traditional methodology, used to infer statistically relevant effects in carefully-chosen variables, and pattern-learning…
▽ More
The last decades saw dramatic progress in brain research. These advances were often buttressed by probing single variables to make circumscribed discoveries, typically through null hypothesis significance testing. New ways for generating massive data fueled tension between the traditional methodology, used to infer statistically relevant effects in carefully-chosen variables, and pattern-learning algorithms, used to identify predictive signatures by searching through abundant information. In this article, we detail the antagonistic philosophies behind two quantitative approaches: certifying robust effects in understandable variables, and evaluating how accurately a built model can forecast future outcomes. We discourage choosing analysis tools via categories like 'statistics' or 'machine learning'. Rather, to establish reproducible knowledge about the brain, we advocate prioritizing tools in view of the core motivation of each quantitative analysis: aiming towards mechanistic insight, or optimizing predictive accuracy.
△ Less
Submitted 21 February, 2019;
originally announced March 2019.
-
Learning Neural Representations of Human Cognition across Many fMRI Studies
Authors:
Arthur Mensch,
Julien Mairal,
Danilo Bzdok,
Bertrand Thirion,
Gaël Varoquaux
Abstract:
Cognitive neuroscience is enjoying rapid increase in extensive public brain-imaging datasets. It opens the door to large-scale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive…
▽ More
Cognitive neuroscience is enjoying rapid increase in extensive public brain-imaging datasets. It opens the door to large-scale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive processes/psychological tasks to brain networks? We cast this challenge in a machine-learning approach to predict conditions from statistical brain maps across different studies. For this, we leverage multi-task learning and multi-scale dimension reduction to learn low-dimensional representations of brain images that carry cognitive information and can be robustly associated with psychological stimuli. Our multi-dataset classification model achieves the best prediction performance on several large reference datasets, compared to models without cognitive-aware low-dimension representations, it brings a substantial performance boost to the analysis of small datasets, and can be introspected to identify universal template cognitive concepts.
△ Less
Submitted 10 November, 2017; v1 submitted 31 October, 2017;
originally announced October 2017.
-
The Future of Data Analysis in the Neurosciences
Authors:
Danilo Bzdok,
B. T. Thomas Yeo
Abstract:
Neuroscience is undergoing faster changes than ever before. Over 100 years our field qualitatively described and invasively manipulated single or few organisms to gain anatomical, physiological, and pharmacological insights. In the last 10 years neuroscience spawned quantitative big-sample datasets on microanatomy, synaptic connections, optogenetic brain-behavior assays, and high-level cognition.…
▽ More
Neuroscience is undergoing faster changes than ever before. Over 100 years our field qualitatively described and invasively manipulated single or few organisms to gain anatomical, physiological, and pharmacological insights. In the last 10 years neuroscience spawned quantitative big-sample datasets on microanatomy, synaptic connections, optogenetic brain-behavior assays, and high-level cognition. While growing data availability and information granularity have been amply discussed, we direct attention to a routinely neglected question: How will the unprecedented data richness shape data analysis practices? Statistical reasoning is becoming more central to distill neurobiological knowledge from healthy and pathological brain recordings. We believe that large-scale data analysis will use more models that are non-parametric, generative, mixing frequentist and Bayesian aspects, and grounded in different statistical inferences.
△ Less
Submitted 5 August, 2016;
originally announced August 2016.
-
Classical Statistics and Statistical Learning in Imaging Neuroscience
Authors:
Danilo Bzdok
Abstract:
Neuroimaging research has predominantly drawn conclusions based on classical statistics, including null-hypothesis testing, t-tests, and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity, including cross-validation, pattern classification, and sparsity-inducing regression. These two methodological families used for neuroimaging data analysis can be viewed as…
▽ More
Neuroimaging research has predominantly drawn conclusions based on classical statistics, including null-hypothesis testing, t-tests, and ANOVA. Throughout recent years, statistical learning methods enjoy increasing popularity, including cross-validation, pattern classification, and sparsity-inducing regression. These two methodological families used for neuroimaging data analysis can be viewed as two extremes of a continuum. Yet, they originated from different historical contexts, build on different theories, rest on different assumptions, evaluate different outcome metrics, and permit different conclusions. This paper portrays commonalities and differences between classical statistics and statistical learning with their relation to neuroimaging research. The conceptual implications are illustrated in three common analysis scenarios. It is thus tried to resolve possible confusion between classical hypothesis testing and data-guided model estimation by discussing their ramifications for the neuroimaging access to neurobiology.
△ Less
Submitted 4 May, 2016; v1 submitted 6 March, 2016;
originally announced March 2016.