-
A Parameter-efficient Multi-subject Model for Predicting fMRI Activity
Authors:
Connor Lane,
Gregory Kiar
Abstract:
This is the Algonauts 2023 submission report for team "BlobGPT". Our model consists of a multi-subject linear encoding head attached to a pretrained trunk model. The multi-subject head consists of three components: (1) a shared multi-layer feature projection, (2) shared plus subject-specific low-dimension linear transformations, and (3) a shared PCA fMRI embedding. In this report, we explain these…
▽ More
This is the Algonauts 2023 submission report for team "BlobGPT". Our model consists of a multi-subject linear encoding head attached to a pretrained trunk model. The multi-subject head consists of three components: (1) a shared multi-layer feature projection, (2) shared plus subject-specific low-dimension linear transformations, and (3) a shared PCA fMRI embedding. In this report, we explain these components in more detail and present some experimental results. Our code is available at https://github.com/cmi-dair/algonauts23.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
A numerical variability approach to results stability tests and its application to neuroimaging
Authors:
Yohan Chatelain,
Loïc Tetrel,
Christopher J. Markiewicz,
Mathias Goncalves,
Gregory Kiar,
Oscar Esteban,
Pierre Bellec,
Tristan Glatard
Abstract:
Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calcula…
▽ More
Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calculations. By applying the resulting stability test to \fmriprep, a widely-used neuroimaging tool, we show that the test is sensitive enough to detect subtle updates in image processing methods while remaining specific enough to accept numerical variations within a reference version of the application. This result contributes to enhancing the reliability and reproducibility of data analyses by providing a robust and flexible method for stability testing.
△ Less
Submitted 10 July, 2023; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Numerical Stability of DeepGOPlus Inference
Authors:
Inés Gonzalez Pepe,
Yohan Chatelain,
Gregory Kiar,
Tristan Glatard
Abstract:
Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical s…
▽ More
Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical stability challenges in DNNs, which also relates to their known sensitivity to noise injection. These challenges can jeopardise their performance and reliability. This paper investigates DeepGOPlus, a CNN that predicts protein function. DeepGOPlus has achieved state-of-the-art performance and can successfully take advantage and annotate the abounding protein sequences emerging in proteomics. We determine the numerical stability of the model's inference stage by quantifying the numerical uncertainty resulting from perturbations of the underlying floating-point data. In addition, we explore the opportunity to use reduced-precision floating point formats for DeepGOPlus inference, to reduce memory consumption and latency. This is achieved by instrumenting DeepGOPlus' execution using Monte Carlo Arithmetic, a technique that experimentally quantifies floating point operation errors and VPREC, a tool that emulates results with customizable floating point precision formats. Focus is placed on the inference stage as it is the primary deliverable of the DeepGOPlus model, widely applicable across different environments. All in all, our results show that although the DeepGOPlus CNN is very stable numerically, it can only be selectively implemented with lower-precision floating-point formats. We conclude that predictions obtained from the pre-trained DeepGOPlus model are very reliable numerically, and use existing floating-point formats efficiently.
△ Less
Submitted 28 February, 2024; v1 submitted 12 December, 2022;
originally announced December 2022.
-
Pipeline-Invariant Representation Learning for Neuroimaging
Authors:
Xinhui Li,
Alex Fedorov,
Mrinal Mathur,
Anees Abrol,
Gregory Kiar,
Sergey Plis,
Vince Calhoun
Abstract:
Deep learning has been widely applied in neuroimaging, including predicting brain-phenotype relationships from magnetic resonance imaging (MRI) volumes. MRI data usually requires extensive preprocessing prior to modeling, but variation introduced by different MRI preprocessing pipelines may lead to different scientific findings, even when using the identical data. Motivated by the data-centric per…
▽ More
Deep learning has been widely applied in neuroimaging, including predicting brain-phenotype relationships from magnetic resonance imaging (MRI) volumes. MRI data usually requires extensive preprocessing prior to modeling, but variation introduced by different MRI preprocessing pipelines may lead to different scientific findings, even when using the identical data. Motivated by the data-centric perspective, we first evaluate how preprocessing pipeline selection can impact the downstream performance of a supervised learning model. We next propose two pipeline-invariant representation learning methodologies, MPSL and PXL, to improve robustness in classification performance and to capture similar neural network representations. Using 2000 human subjects from the UK Biobank dataset, we demonstrate that proposed models present unique and shared advantages, in particular that MPSL can be used to improve out-of-sample generalization to new pipelines, while PXL can be used to improve within-sample prediction performance. Both MPSL and PXL can learn more similar between-pipeline representations. These results suggest that our proposed models can be applied to mitigate pipeline-related biases, and to improve prediction robustness in brain-phenotype modeling.
△ Less
Submitted 15 October, 2023; v1 submitted 26 August, 2022;
originally announced August 2022.
-
PyTracer: Automatically profiling numerical instabilities in Python
Authors:
Yohan Chatelain,
Nigel Yong,
Gregory Kiar,
Tristan Glatard
Abstract:
Numerical stability is a crucial requirement of reliable scientific computing. However, despite the pervasiveness of Python in data science, analyzing large Python programs remains challenging due to the lack of scalable numerical analysis tools available for this language. To fill this gap, we developed PyTracer, a profiler to quantify numerical instability in Python applications. PyTracer transp…
▽ More
Numerical stability is a crucial requirement of reliable scientific computing. However, despite the pervasiveness of Python in data science, analyzing large Python programs remains challenging due to the lack of scalable numerical analysis tools available for this language. To fill this gap, we developed PyTracer, a profiler to quantify numerical instability in Python applications. PyTracer transparently instruments Python code to produce numerical traces and visualize them interactively in a Plotly dashboard. We designed PyTracer to be agnostic to numerical noise model, allowing for tool evaluation through Monte-Carlo Arithmetic, random rounding, random data perturbation, or structured noise for a particular application. We illustrate PyTracer's capabilities by testing the numerical stability of key functions in both SciPy and Scikit-learn, two dominant Python libraries for mathematical modeling. Through these evaluations, we demonstrate PyTracer as a scalable, automatic, and generic framework for numerical profiling in Python.
△ Less
Submitted 8 February, 2022; v1 submitted 21 December, 2021;
originally announced December 2021.
-
Data Augmentation Through Monte Carlo Arithmetic Leads to More Generalizable Classification in Connectomics
Authors:
Gregory Kiar,
Yohan Chatelain,
Ali Salari,
Alan C. Evans,
Tristan Glatard
Abstract:
Machine learning models are commonly applied to human brain imaging datasets in an effort to associate function or structure with behaviour, health, or other individual phenotypes. Such models often rely on low-dimensional maps generated by complex processing pipelines. However, the numerical instabilities inherent to pipelines limit the fidelity of these maps and introduce computational bias. Mon…
▽ More
Machine learning models are commonly applied to human brain imaging datasets in an effort to associate function or structure with behaviour, health, or other individual phenotypes. Such models often rely on low-dimensional maps generated by complex processing pipelines. However, the numerical instabilities inherent to pipelines limit the fidelity of these maps and introduce computational bias. Monte Carlo Arithmetic, a technique for introducing controlled amounts of numerical noise, was used to perturb a structural connectome estimation pipeline, ultimately producing a range of plausible networks for each sample. The variability in the perturbed networks was captured in an augmented dataset, which was then used for an age classification task. We found that resampling brain networks across a series of such numerically perturbed outcomes led to improved performance in all tested classifiers, preprocessing strategies, and dimensionality reduction techniques. Importantly, we find that this benefit does not hinge on a large number of perturbations, suggesting that even minimally perturbing a dataset adds meaningful variance which can be captured in the subsequently designed models.
△ Less
Submitted 20 September, 2021;
originally announced September 2021.
-
A Recommender System for Scientific Datasets and Analysis Pipelines
Authors:
Mandana Mazaheri,
Gregory Kiar,
Tristan Glatard
Abstract:
Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resource…
▽ More
Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resources. We investigate the feasibility of a collaborative filtering system to recommend pipelines and datasets based on provenance records from previous executions. We evaluate our system using datasets and pipelines extracted from the Canadian Open Neuroscience Platform, a national initiative for open neuroscience. The recommendations provided by our system (AUC$=0.83$) are significantly better than chance and outperform recommendations made by domain experts using their previous knowledge as well as pipeline and dataset descriptions (AUC$=0.63$). In particular, domain experts often neglect low-level technical aspects of a pipeline-dataset interaction, such as the level of pre-processing, which are captured by a provenance-based system. We conclude that provenance-based pipeline and dataset recommenders are feasible and beneficial to the sharing and usage of open-science resources. Future work will focus on the collection of more comprehensive provenance traces, and on deploying the system in production.
△ Less
Submitted 20 August, 2021;
originally announced August 2021.
-
Reducing numerical precision preserves classification accuracy in Mondrian Forests
Authors:
Marc Vicuna,
Martin Khannouz,
Gregory Kiar,
Yohan Chatelain,
Tristan Glatard
Abstract:
Mondrian Forests are a powerful data stream classification method, but their large memory footprint makes them ill-suited for low-resource platforms such as connected objects. We explored using reduced-precision floating-point representations to lower memory consumption and evaluated its effect on classification performance. We applied the Mondrian Forest implementation provided by OrpailleCC, a C…
▽ More
Mondrian Forests are a powerful data stream classification method, but their large memory footprint makes them ill-suited for low-resource platforms such as connected objects. We explored using reduced-precision floating-point representations to lower memory consumption and evaluated its effect on classification performance. We applied the Mondrian Forest implementation provided by OrpailleCC, a C++ collection of data stream algorithms, to two canonical datasets in human activity recognition: Recofit and Banos \emph{et al}. Results show that the precision of floating-point values used by tree nodes can be reduced from 64 bits to 8 bits with no significant difference in F1 score. In some cases, reduced precision was shown to improve classification performance, presumably due to its regularization effect. We conclude that numerical precision is a relevant hyperparameter in the Mondrian Forest, and that commonly-used double precision values may not be necessary for optimal performance. Future work will evaluate the generalizability of these findings to other data stream classifiers.
△ Less
Submitted 27 June, 2021;
originally announced June 2021.
-
Deploying large fixed file datasets with SquashFS and Singularity
Authors:
Pierre Rioux,
Gregory Kiar,
Alexandre Hutton,
Alan C. Evans,
Shawn T. Brown
Abstract:
Shared high-performance computing (HPC) platforms, such as those provided by XSEDE and Compute Canada, enable researchers to carry out large-scale computational experiments at a fraction of the cost of the cloud. Most systems require the use of distributed filesystems (e.g. Lustre) for providing a highly multi-user, large capacity storage environment. These suffer performance penalties as the numb…
▽ More
Shared high-performance computing (HPC) platforms, such as those provided by XSEDE and Compute Canada, enable researchers to carry out large-scale computational experiments at a fraction of the cost of the cloud. Most systems require the use of distributed filesystems (e.g. Lustre) for providing a highly multi-user, large capacity storage environment. These suffer performance penalties as the number of files increases due to network contention and metadata performance. We demonstrate how a combination of two technologies, Singularity and SquashFS, can help developers, integrators, architects, and scientists deploy large datasets (O(10M) files) on these shared systems with minimal performance limitations. The proposed integration enables more efficient access and indexing than normal file-based dataset installations, while providing transparent file access to users and processes. Furthermore, the approach does not require administrative privileges on the target system. While the examples studied here have been taken from the field of neuroimaging, the technologies adopted are not specific to that field. Currently, this solution is limited to read-only datasets. We propose the adoption of this technology for the consumption and dissemination of community datasets across shared computing resources.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
A Serverless Tool for Platform Agnostic Computational Experiment Management
Authors:
Gregory Kiar,
Shawn T Brown,
Tristan Glatard,
Alan C Evans
Abstract:
Neuroscience has been carried into the domain of big data and high performance computing (HPC) on the backs of initiatives in data collection and an increasingly compute-intensive tools. While managing HPC experiments requires considerable technical acumen, platforms and standards have been developed to ease this burden on scientists. While web-portals make resources widely accessible, data organi…
▽ More
Neuroscience has been carried into the domain of big data and high performance computing (HPC) on the backs of initiatives in data collection and an increasingly compute-intensive tools. While managing HPC experiments requires considerable technical acumen, platforms and standards have been developed to ease this burden on scientists. While web-portals make resources widely accessible, data organizations such as the Brain Imaging Data Structure and tool description languages such as Boutiques provide researchers with a foothold to tackle these problems using their own datasets, pipelines, and environments. While these standards lower the barrier to adoption of HPC and cloud systems for neuroscience applications, they still require the consolidation of disparate domain-specific knowledge. We present Clowdr, a lightweight tool to launch experiments on HPC systems and clouds, record rich execution records, and enable the accessible sharing of experimental summaries and results. Clowdr uniquely sits between web platforms and bare-metal applications for experiment management by preserving the flexibility of do-it-yourself solutions while providing a low barrier for developing, deploying and disseminating neuroscientific analysis.
△ Less
Submitted 2 September, 2018;
originally announced September 2018.
-
Boutiques: a flexible framework for automated application integration in computing platforms
Authors:
Tristan Glatard,
Gregory Kiar,
Tristan Aumentado-Armstrong,
Natacha Beck,
Pierre Bellec,
Rémi Bernard,
Axel Bonnet,
Sorina Camarasu-Pop,
Frédéric Cervenansky,
Samir Das,
Rafael Ferreira da Silva,
Guillaume Flandin,
Pascal Girard,
Krzysztof J. Gorgolewski,
Charles R. G. Guttmann,
Valérie Hayot-Sasson,
Pierre-Olivier Quirion,
Pierre Rioux,
Marc-Eienne Rousseau,
Alan C. Evans
Abstract:
We present Boutiques, a system to automatically publish, integrate and execute applications across computational platforms. Boutiques applications are installed through software containers described in a rich and flexible JSON language. A set of core tools facilitate the construction, validation, import, execution, and publishing of applications. Boutiques is currently supported by several distinc…
▽ More
We present Boutiques, a system to automatically publish, integrate and execute applications across computational platforms. Boutiques applications are installed through software containers described in a rich and flexible JSON language. A set of core tools facilitate the construction, validation, import, execution, and publishing of applications. Boutiques is currently supported by several distinct virtual research platforms, and it has been used to describe dozens of applications in the neuroinformatics domain. We expect Boutiques to improve the quality of application integration in computational platforms, to reduce redundancy of effort, to contribute to computational reproducibility, and to foster Open Science.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.