-
The Explanation Necessity for Healthcare AI
Authors:
Michail Mamalakis,
Héloïse de Vareilles,
Graham Murray,
Pietro Lio,
John Suckling
Abstract:
Explainability is a critical factor in enhancing the trustworthiness and acceptance of artificial intelligence (AI) in healthcare, where decisions directly impact patient outcomes. Despite advancements in AI interpretability, clear guidelines on when and to what extent explanations are required in medical applications remain lacking. We propose a novel categorization system comprising four classes…
▽ More
Explainability is a critical factor in enhancing the trustworthiness and acceptance of artificial intelligence (AI) in healthcare, where decisions directly impact patient outcomes. Despite advancements in AI interpretability, clear guidelines on when and to what extent explanations are required in medical applications remain lacking. We propose a novel categorization system comprising four classes of explanation necessity (self-explainable, semi-explainable, non-explainable, and new-patterns discovery), guiding the required level of explanation; whether local (patient or sample level), global (cohort or dataset level), or both. To support this system, we introduce a mathematical formulation that incorporates three key factors: (i) robustness of the evaluation protocol, (ii) variability of expert observations, and (iii) representation dimensionality of the application. This framework provides a practical tool for researchers to determine the appropriate depth of explainability needed, addressing the critical question: When does an AI medical application need to be explained, and at what level of detail?
△ Less
Submitted 28 February, 2025; v1 submitted 31 May, 2024;
originally announced June 2024.
-
Contrastive-Adversarial and Diffusion: Exploring pre-training and fine-tuning strategies for sulcal identification
Authors:
Michail Mamalakis,
Héloïse de Vareilles,
Shun-Chin Jim Wu,
Ingrid Agartz,
Lynn Egeland Mørch-Johnsen,
Jane Garrison,
Jon Simons,
Pietro Lio,
John Suckling,
Graham Murray
Abstract:
In the last decade, computer vision has witnessed the establishment of various training and learning approaches. Techniques like adversarial learning, contrastive learning, diffusion denoising learning, and ordinary reconstruction learning have become standard, representing state-of-the-art methods extensively employed for fully training or pre-training networks across various vision tasks. The ex…
▽ More
In the last decade, computer vision has witnessed the establishment of various training and learning approaches. Techniques like adversarial learning, contrastive learning, diffusion denoising learning, and ordinary reconstruction learning have become standard, representing state-of-the-art methods extensively employed for fully training or pre-training networks across various vision tasks. The exploration of fine-tuning approaches has emerged as a current focal point, addressing the need for efficient model tuning with reduced GPU memory usage and time costs while enhancing overall performance, as exemplified by methodologies like low-rank adaptation (LoRA). Key questions arise: which pre-training technique yields optimal results - adversarial, contrastive, reconstruction, or diffusion denoising? How does the performance of these approaches vary as the complexity of fine-tuning is adjusted? This study aims to elucidate the advantages of pre-training techniques and fine-tuning strategies to enhance the learning process of neural networks in independent identical distribution (IID) cohorts. We underscore the significance of fine-tuning by examining various cases, including full tuning, decoder tuning, top-level tuning, and fine-tuning of linear parameters using LoRA. Systematic summaries of model performance and efficiency are presented, leveraging metrics such as accuracy, time cost, and memory efficiency. To empirically demonstrate our findings, we focus on a multi-task segmentation-classification challenge involving the paracingulate sulcus (PCS) using different 3D Convolutional Neural Network (CNN) architectures by using the TOP-OSLO cohort comprising 596 subjects.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Solving the enigma: Enhancing faithfulness and comprehensibility in explanations of deep networks
Authors:
Michail Mamalakis,
Antonios Mamalakis,
Ingrid Agartz,
Lynn Egeland Mørch-Johnsen,
Graham Murray,
John Suckling,
Pietro Lio
Abstract:
The accelerated progress of artificial intelligence (AI) has popularized deep learning models across various domains, yet their inherent opacity poses challenges, particularly in critical fields like healthcare, medicine, and the geosciences. Explainable AI (XAI) has emerged to shed light on these 'black box' models, aiding in deciphering their decision-making processes. However, different XAI met…
▽ More
The accelerated progress of artificial intelligence (AI) has popularized deep learning models across various domains, yet their inherent opacity poses challenges, particularly in critical fields like healthcare, medicine, and the geosciences. Explainable AI (XAI) has emerged to shed light on these 'black box' models, aiding in deciphering their decision-making processes. However, different XAI methods often produce significantly different explanations, leading to high inter-method variability that increases uncertainty and undermines trust in deep networks' predictions. In this study, we address this challenge by introducing a novel framework designed to enhance the explainability of deep networks through a dual focus on maximizing both accuracy and comprehensibility in the explanations. Our framework integrates outputs from multiple established XAI methods and leverages a non-linear neural network model, termed the 'explanation optimizer,' to construct a unified, optimal explanation. The optimizer evaluates explanations using two key metrics: faithfulness (accuracy in reflecting the network's decisions) and complexity (comprehensibility). By balancing these, it provides accurate and accessible explanations, addressing a key XAI limitation. Experiments on multi-class and binary classification in 2D object and 3D neuroscience imaging confirm its efficacy. Our optimizer achieved faithfulness scores 155% and 63% higher than the best XAI methods in 3D and 2D tasks, respectively, while also reducing complexity for better understanding. These results demonstrate that optimal explanations based on specific quality criteria are achievable, offering a solution to the issue of inter-method variability in the current XAI literature and supporting more trustworthy deep network predictions
△ Less
Submitted 28 February, 2025; v1 submitted 16 May, 2024;
originally announced May 2024.
-
Statistical Agnostic Regression: a machine learning method to validate regression models
Authors:
Juan M Gorriz,
J. Ramirez,
F. Segovia,
F. J. Martinez-Murcia,
C. Jiménez-Mesa,
J. Suckling
Abstract:
Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive model…
▽ More
Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources. Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection. In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least $1-η$, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables. Simulations demonstrate the ability of the proposed agnostic (non-parametric) test to provide an analysis of variance similar to the classical multivariate $F$-test for the slope parameter, without relying on the underlying assumptions of classical methods. Moreover, the residuals computed from this method represent a trade-off between those obtained from ML approaches and the classical OLS.
△ Less
Submitted 9 November, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
Is K-fold cross validation the best model selection method for Machine Learning?
Authors:
Juan M Gorriz,
R. Martin Clemente,
F Segovia,
J Ramirez,
A Ortiz,
J. Suckling
Abstract:
As a technique that can compactly represent complex patterns, machine learning has significant potential for predictive inference. K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance, and it frequently outperforms conventional hypothesis testing. This improvement uses measures directly obtained from machine…
▽ More
As a technique that can compactly represent complex patterns, machine learning has significant potential for predictive inference. K-fold cross-validation (CV) is the most common approach to ascertaining the likelihood that a machine learning outcome is generated by chance, and it frequently outperforms conventional hypothesis testing. This improvement uses measures directly obtained from machine learning classifications, such as accuracy, that do not have a parametric description. To approach a frequentist analysis within machine learning pipelines, a permutation test or simple statistics from data partitions (i.e., folds) can be added to estimate confidence intervals. Unfortunately, neither parametric nor non-parametric tests solve the inherent problems of partitioning small sample-size datasets and learning from heterogeneous data sources. The fact that machine learning strongly depends on the learning parameters and the distribution of data across folds recapitulates familiar difficulties around excess false positives and replication. A novel statistical test based on K-fold CV and the Upper Bound of the actual risk (K-fold CUBV) is proposed, where uncertain predictions of machine learning with CV are bounded by the worst case through the evaluation of concentration inequalities. Probably Approximately Correct-Bayesian upper bounds for linear classifiers in combination with K-fold CV are derived and used to estimate the actual risk. The performance with simulated and neuroimaging datasets suggests that K-fold CUBV is a robust criterion for detecting effects and validating accuracy values obtained from machine learning and classical CV schemes, while avoiding excess false positives.
△ Less
Submitted 8 November, 2024; v1 submitted 29 January, 2024;
originally announced January 2024.
-
An explainable three dimension framework to uncover learning patterns: A unified look in variable sulci recognition
Authors:
Michail Mamalakis,
Heloise de Vareilles,
Atheer AI-Manea,
Samantha C. Mitchell,
Ingrid Arartz,
Lynn Egeland Morch-Johnsen,
Jane Garrison,
Jon Simons,
Pietro Lio,
John Suckling,
Graham Murray
Abstract:
The significant features identified in a representative subset of the dataset during the learning process of an artificial intelligence model are referred to as a 'global' explanation. 3D global explanations are crucial in neuroimaging, where a complex representational space demands more than basic 2D interpretations. However, current studies in the literature often lack the accuracy, comprehensib…
▽ More
The significant features identified in a representative subset of the dataset during the learning process of an artificial intelligence model are referred to as a 'global' explanation. 3D global explanations are crucial in neuroimaging, where a complex representational space demands more than basic 2D interpretations. However, current studies in the literature often lack the accuracy, comprehensibility, and 3D global explanations needed in neuroimaging and beyond. To address this gap, we developed an explainable artificial intelligence (XAI) 3D-Framework capable of providing accurate, low-complexity global explanations. We evaluated the framework using various 3D deep learning models trained on a well-annotated cohort of 596 structural MRIs. The binary classification task focused on detecting the presence or absence of the paracingulate sulcus, a highly variable brain structure associated with psychosis. Our framework integrates statistical features (Shape) and XAI methods (GradCam and SHAP) with dimensionality reduction, ensuring that explanations reflect both model learning and cohort-specific variability. By combining Shape, GradCam, and SHAP, our framework reduces inter-method variability, enhancing the faithfulness and reliability of global explanations. These robust explanations facilitated the identification of critical sub-regions, including the posterior temporal and internal parietal regions, as well as the cingulate region and thalamus, suggesting potential genetic or developmental influences.
Our XAI 3D-Framework leverages global explanations to uncover the broader developmental context of specific cortical features. This approach advances the fields of deep learning and neuroscience by offering insights into normative brain development and atypical trajectories linked to mental illness, paving the way for more reliable and interpretable AI applications in neuroimaging.
△ Less
Submitted 28 November, 2024; v1 submitted 2 September, 2023;
originally announced September 2023.
-
A hypothesis-driven method based on machine learning for neuroimaging data analysis
Authors:
JM Gorriz,
R. Martin-Clemente,
C. G. Puntonet,
A. Ortiz,
J. Ramirez,
J. Suckling
Abstract:
There remains an open question about the usefulness and the interpretation of Machine learning (MLE) approaches for discrimination of spatial patterns of brain images between samples or activation states. In the last few decades, these approaches have limited their operation to feature extraction and linear classification tasks for between-group inference. In this context, statistical inference is…
▽ More
There remains an open question about the usefulness and the interpretation of Machine learning (MLE) approaches for discrimination of spatial patterns of brain images between samples or activation states. In the last few decades, these approaches have limited their operation to feature extraction and linear classification tasks for between-group inference. In this context, statistical inference is assessed by randomly permuting image labels or by the use of random effect models that consider between-subject variability. These multivariate MLE-based statistical pipelines, whilst potentially more effective for detecting activations than hypotheses-driven methods, have lost their mathematical elegance, ease of interpretation, and spatial localization of the ubiquitous General linear Model (GLM). Recently, the estimation of the conventional GLM has been demonstrated to be connected to an univariate classification task when the design matrix is expressed as a binary indicator matrix. In this paper we explore the complete connection between the univariate GLM and MLE \emph{regressions}. To this purpose we derive a refined statistical test with the GLM based on the parameters obtained by a linear Support Vector Regression (SVR) in the \emph{inverse} problem (SVR-iGLM). Subsequently, random field theory (RFT) is employed for assessing statistical significance following a conventional GLM benchmark. Experimental results demonstrate how parameter estimations derived from each model (mainly GLM and SVR) result in different experimental design estimates that are significantly related to the predefined functional task. Moreover, using real data from a multisite initiative the proposed MLE-based inference demonstrates statistical power and the control of false positives, outperforming the regular GLM.
△ Less
Submitted 17 February, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Deep Learning in current Neuroimaging: a multivariate approach with power and type I error control but arguable generalization ability
Authors:
Carmen Jiménez-Mesa,
Javier Ramírez,
John Suckling,
Jonathan Vöglein,
Johannes Levin,
Juan Manuel Górriz,
Alzheimer's Disease Neuroimaging Initiative ADNI,
Dominantly Inherited Alzheimer Network DIAN
Abstract:
Discriminative analysis in neuroimaging by means of deep/machine learning techniques is usually tested with validation techniques, whereas the associated statistical significance remains largely under-developed due to their computational complexity. In this work, a non-parametric framework is proposed that estimates the statistical significance of classifications using deep learning architectures.…
▽ More
Discriminative analysis in neuroimaging by means of deep/machine learning techniques is usually tested with validation techniques, whereas the associated statistical significance remains largely under-developed due to their computational complexity. In this work, a non-parametric framework is proposed that estimates the statistical significance of classifications using deep learning architectures. In particular, a combination of autoencoders (AE) and support vector machines (SVM) is applied to: (i) a one-condition, within-group designs often of normal controls (NC) and; (ii) a two-condition, between-group designs which contrast, for example, Alzheimer's disease (AD) patients with NC (the extension to multi-class analyses is also included). A random-effects inference based on a label permutation test is proposed in both studies using cross-validation (CV) and resubstitution with upper bound correction (RUB) as validation methods. This allows both false positives and classifier overfitting to be detected as well as estimating the statistical power of the test. Several experiments were carried out using the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, the Dominantly Inherited Alzheimer Network (DIAN) dataset, and a MCI prediction dataset. We found in the permutation test that CV and RUB methods offer a false positive rate close to the significance level and an acceptable statistical power (although lower using cross-validation). A large separation between training and test accuracies using CV was observed, especially in one-condition designs. This implies a low generalization ability as the model fitted in training is not informative with respect to the test set. We propose as solution by applying RUB, whereby similar results are obtained to those of the CV test set, but considering the whole set and with a lower computational cost per iteration.
△ Less
Submitted 30 March, 2021;
originally announced March 2021.
-
A connection between the pattern classification problem and the General Linear Model for statistical inference
Authors:
Juan Manuel Gorriz,
SIPBA group,
John Suckling
Abstract:
A connection between the General Linear Model (GLM) in combination with classical statistical inference and the machine learning (MLE)-based inference is described in this paper. Firstly, the estimation of the GLM parameters is expressed as a Linear Regression Model (LRM) of an indicator matrix, that is, in terms of the inverse problem of regressing the observations. In other words, both approache…
▽ More
A connection between the General Linear Model (GLM) in combination with classical statistical inference and the machine learning (MLE)-based inference is described in this paper. Firstly, the estimation of the GLM parameters is expressed as a Linear Regression Model (LRM) of an indicator matrix, that is, in terms of the inverse problem of regressing the observations. In other words, both approaches, i.e. GLM and LRM, apply to different domains, the observation and the label domains, and are linked by a normalization value at the least-squares solution. Subsequently, from this relationship we derive a statistical test based on a more refined predictive algorithm, i.e. the (non)linear Support Vector Machine (SVM) that maximizes the class margin of separation, within a permutation analysis. The MLE-based inference employs a residual score and includes the upper bound to compute a better estimation of the actual (real) error. Experimental results demonstrate how the parameter estimations derived from each model resulted in different classification performances in the equivalent inverse problem. Moreover, using real data the aforementioned predictive algorithms within permutation tests, including such model-free estimators, are able to provide a good trade-off between type I error and statistical power.
△ Less
Submitted 16 December, 2020;
originally announced December 2020.
-
Single-participant structural connectivity matrices lead to greater accuracy in classification of participants than function in autism in MRI
Authors:
Matthew Leming,
Simon Baron-Cohen,
John Suckling
Abstract:
In this work, we introduce a technique of deriving symmetric connectivity matrices from regional histograms of grey-matter volume estimated from T1-weighted MRIs. We then validated the technique by inputting the connectivity matrices into a convolutional neural network (CNN) to classify between participants with autism and age-, motion-, and intracranial-volume-matched controls from six different…
▽ More
In this work, we introduce a technique of deriving symmetric connectivity matrices from regional histograms of grey-matter volume estimated from T1-weighted MRIs. We then validated the technique by inputting the connectivity matrices into a convolutional neural network (CNN) to classify between participants with autism and age-, motion-, and intracranial-volume-matched controls from six different databases (29,288 total connectomes, mean age = 30.72, range 0.42-78.00, including 1555 subjects with autism). We compared this method to similar classifications of the same participants using fMRI connectivity matrices as well as univariate estimates of grey-matter volumes. We further applied graph-theoretical metrics on output class activation maps to identify areas of the matrices that the CNN preferentially used to make the classification, focusing particularly on hubs. Our results gave AUROCs of 0.7298 (69.71% accuracy) when classifying by only structural connectivity, 0.6964 (67.72% accuracy) when classifying by only functional connectivity, and 0.7037 (66.43% accuracy) when classifying by univariate grey matter volumes. Combining structural and functional connectivities gave an AUROC of 0.7354 (69.40% accuracy). Graph analysis of class activation maps revealed no distinguishable network patterns for functional inputs, but did reveal localized differences between groups in bilateral Heschl's gyrus and upper vermis for structural connectivity. This work provides a simple means of feature extraction for inputting large numbers of structural MRIs into machine learning models.
△ Less
Submitted 27 May, 2020; v1 submitted 16 May, 2020;
originally announced May 2020.
-
Stochastic encoding of graphs in deep learning allows for complex analysis of gender classification in resting-state and task functional brain networks from the UK Biobank
Authors:
Matthew Leming,
John Suckling
Abstract:
Classification of whole-brain functional connectivity MRI data with convolutional neural networks (CNNs) has shown promise, but the complexity of these models impedes understanding of which aspects of brain activity contribute to classification. While visualization techniques have been developed to interpret CNNs, bias inherent in the method of encoding abstract input data, as well as the natural…
▽ More
Classification of whole-brain functional connectivity MRI data with convolutional neural networks (CNNs) has shown promise, but the complexity of these models impedes understanding of which aspects of brain activity contribute to classification. While visualization techniques have been developed to interpret CNNs, bias inherent in the method of encoding abstract input data, as well as the natural variance of deep learning models, detract from the accuracy of these techniques. We introduce a stochastic encoding method in an ensemble of CNNs to classify functional connectomes by gender. We applied our method to resting-state and task data from the UK BioBank, using two visualization techniques to measure the salience of three brain networks involved in task- and resting-states, and their interaction. To regress confounding factors such as head motion, age, and intracranial volume, we introduced a multivariate balancing algorithm to ensure equal distributions of such covariates between classes in our data. We achieved a final AUROC of 0.8459. We found that resting-state data classifies more accurately than task data, with the inner salience network playing the most important role of the three networks overall in classification of resting-state data and connections to the central executive network in task data.
△ Less
Submitted 27 May, 2020; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Ensemble Deep Learning on Large, Mixed-Site fMRI Datasets in Autism and Other Tasks
Authors:
Matthew Leming,
Juan Manuel Górriz,
John Suckling
Abstract:
Deep learning models for MRI classification face two recurring problems: they are typically limited by low sample size, and are abstracted by their own complexity (the "black box problem"). In this paper, we train a convolutional neural network (CNN) with the largest multi-source, functional MRI (fMRI) connectomic dataset ever compiled, consisting of 43,858 datapoints. We apply this model to a cro…
▽ More
Deep learning models for MRI classification face two recurring problems: they are typically limited by low sample size, and are abstracted by their own complexity (the "black box problem"). In this paper, we train a convolutional neural network (CNN) with the largest multi-source, functional MRI (fMRI) connectomic dataset ever compiled, consisting of 43,858 datapoints. We apply this model to a cross-sectional comparison of autism (ASD) vs typically developing (TD) controls that has proved difficult to characterise with inferential statistics. To contextualise these findings, we additionally perform classifications of gender and task vs rest. Employing class-balancing to build a training set, we trained 3$\times$300 modified CNNs in an ensemble model to classify fMRI connectivity matrices with overall AUROCs of 0.6774, 0.7680, and 0.9222 for ASD vs TD, gender, and task vs rest, respectively. Additionally, we aim to address the black box problem in this context using two visualization methods. First, class activation maps show which functional connections of the brain our models focus on when performing classification. Second, by analyzing maximal activations of the hidden layers, we were also able to explore how the model organizes a large and mixed-centre dataset, finding that it dedicates specific areas of its hidden layers to processing different covariates of data (depending on the independent variable analyzed), and other areas to mix data from different sources. Our study finds that deep learning models that distinguish ASD from TD controls focus broadly on temporal and cerebellar connections, with a particularly high focus on the right caudate nucleus and paracentral sulcus.
△ Less
Submitted 27 May, 2020; v1 submitted 14 February, 2020;
originally announced February 2020.