-
Tensor-variate Gaussian process regression for efficient emulation of complex systems: comparing regressor and covariance structures in outer product and parallel partial emulators
Authors:
Daria Semochkina,
Samuel E. Jackson,
David C. Woods
Abstract:
Multi-output Gaussian process regression has become an important tool in uncertainty quantification, for building emulators of computationally expensive simulators, and other areas such as multi-task machine learning. We present a holistic development of tensor-variate Gaussian process (TvGP) regression, appropriate for arbitrary dimensional outputs where a Kronecker product structure is appropria…
▽ More
Multi-output Gaussian process regression has become an important tool in uncertainty quantification, for building emulators of computationally expensive simulators, and other areas such as multi-task machine learning. We present a holistic development of tensor-variate Gaussian process (TvGP) regression, appropriate for arbitrary dimensional outputs where a Kronecker product structure is appropriate for the covariance. We show how two common approaches to problems with two-dimensional output, outer product emulators (OPE) and parallel partial emulators (PPE), are special cases of TvGP regression and hence can be extended to higher output dimensions. Focusing on the important special case of matrix output, we investigate the relative performance of these two approaches. The key distinction is the additional dependence structure assumed by the OPE, and we demonstrate when this is advantageous through two case studies, including application to a spatial-temporal influenza simulator.
△ Less
Submitted 14 February, 2025;
originally announced February 2025.
-
Bayes Linear Analysis for Statistical Modelling with Uncertain Inputs
Authors:
Samuel E. Jackson,
David C. Woods
Abstract:
Statistical models typically capture uncertainties in our knowledge of the corresponding real-world processes, however, it is less common for this uncertainty specification to capture uncertainty surrounding the values of the inputs to the model, which are often assumed known. We develop general modelling methodology with uncertain inputs in the context of the Bayes linear paradigm, which involves…
▽ More
Statistical models typically capture uncertainties in our knowledge of the corresponding real-world processes, however, it is less common for this uncertainty specification to capture uncertainty surrounding the values of the inputs to the model, which are often assumed known. We develop general modelling methodology with uncertain inputs in the context of the Bayes linear paradigm, which involves adjustment of second-order belief specifications over all quantities of interest only, without the requirement for probabilistic specifications. In particular, we propose an extension of commonly-employed second-order modelling assumptions to the case of uncertain inputs, with explicit implementation in the context of regression analysis, stochastic process modelling, and statistical emulation. We apply the methodology to a regression model for extracting aluminium by electrolysis, and emulation of the motivating epidemiological simulator chain to model the impact of an airborne infectious disease.
△ Less
Submitted 9 May, 2023;
originally announced May 2023.
-
Synthesis parameter effect detection using quantitative representations and high dimensional distribution distances
Authors:
Alex Hagen,
Shane Jackson
Abstract:
Detection of effects of the parameters of the synthetic process on the microstructure of materials is an important, yet elusive goal of materials science. We develop a method for detecting effects based on copula theory, high dimensional distribution distances, and permutational statistics to analyze a designed experiment synthesizing plutonium oxide from Pu(III) Oxalate. We detect effects of stri…
▽ More
Detection of effects of the parameters of the synthetic process on the microstructure of materials is an important, yet elusive goal of materials science. We develop a method for detecting effects based on copula theory, high dimensional distribution distances, and permutational statistics to analyze a designed experiment synthesizing plutonium oxide from Pu(III) Oxalate. We detect effects of strike order and oxalic acid feed on the microstructure of the resulting plutonium oxide, which match the literature well. We also detect excess bivariate effects between the pairs of acid concentration, strike order and precipitation temperature.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance
Authors:
Alex Hagen,
Shane Jackson,
James Kahn,
Jan Strube,
Isabel Haide,
Karl Pazdernik,
Connor Hainje
Abstract:
Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result th…
▽ More
Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
Data Vision: Learning to See Through Algorithmic Abstraction
Authors:
Samir Passi,
Steven J. Jackson
Abstract:
Learning to see through data is central to contemporary forms of algorithmic knowledge production. While often represented as a mechanical application of rules, making algorithms work with data requires a great deal of situated work. This paper examines how the often-divergent demands of mechanization and discretion manifest in data analytic learning environments. Drawing on research in CSCW and t…
▽ More
Learning to see through data is central to contemporary forms of algorithmic knowledge production. While often represented as a mechanical application of rules, making algorithms work with data requires a great deal of situated work. This paper examines how the often-divergent demands of mechanization and discretion manifest in data analytic learning environments. Drawing on research in CSCW and the social sciences, and ethnographic fieldwork in two data learning environments, we show how an algorithm's application is seen sometimes as a mechanical sequence of rules and at other times as an array of situated decisions. Casting data analytics as a rule-based (rather than rule-bound) practice, we show that effective data vision requires would-be analysts to straddle the competing demands of formal abstraction and empirical contingency. We conclude by discussing how the notion of data vision can help better leverage the role of human work in data analytic learning, research, and practice.
△ Less
Submitted 9 February, 2020;
originally announced February 2020.
-
Efficient Emulation of Computer Models Utilising Multiple Known Boundaries of Differing Dimensions
Authors:
Samuel E. Jackson,
Ian Vernon
Abstract:
Emulation has been successfully applied across a wide variety of scientific disciplines for efficiently analysing computationally intensive models. We develop known boundary emulation strategies which utilise the fact that, for many computer models, there exist hyperplanes in the input parameter space for which the model output can be evaluated far more efficiently, whether this be analytically or…
▽ More
Emulation has been successfully applied across a wide variety of scientific disciplines for efficiently analysing computationally intensive models. We develop known boundary emulation strategies which utilise the fact that, for many computer models, there exist hyperplanes in the input parameter space for which the model output can be evaluated far more efficiently, whether this be analytically or just significantly faster using a more efficient and simpler numerical solver. The information contained on these known hyperplanes, or boundaries, can be incorporated into the emulation process via analytical update, thus involving no additional computational cost. In this article, we show that such analytical updates are available for multiple boundaries of various dimensions. We subsequently demonstrate which configurations of boundaries such analytical updates are available for, in particular by presenting a set of conditions that such a set of boundaries must satisfy. We demonstrate the powerful computational advantages of the known boundary emulation techniques developed on both an illustrative low-dimensional simulated example and a scientifically relevant and high-dimensional systems biology model of hormonal crosstalk in the roots of an Arabidopsis plant.
△ Less
Submitted 13 March, 2020; v1 submitted 19 October, 2019;
originally announced October 2019.
-
Bayes Linear Emulation of Simulator Networks
Authors:
Samuel E. Jackson,
David C. Woods
Abstract:
Computationally expensive simulators, implementing mathematical models in computer codes, are commonly approximated using statistical emulators. We develop and assess novel emulation methods for systems best modelled via a chain, series or network of simulators. Using a Bayes linear framework, we link statistical emulators of the component simulators to explicitly account for the simulator input u…
▽ More
Computationally expensive simulators, implementing mathematical models in computer codes, are commonly approximated using statistical emulators. We develop and assess novel emulation methods for systems best modelled via a chain, series or network of simulators. Using a Bayes linear framework, we link statistical emulators of the component simulators to explicitly account for the simulator input uncertainty induced by links between models in arbitrarily large networks. We demonstrate the advantages of these methods compared to use of a single emulator of the composite simulator network for a variety of examples, including the motivating epidemiological simulator chain to model the impact of an airborne infectious disease.
△ Less
Submitted 25 August, 2021; v1 submitted 17 October, 2019;
originally announced October 2019.
-
Identifying the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progression Assessment, and Overall Survival Prediction in the BRATS Challenge
Authors:
Spyridon Bakas,
Mauricio Reyes,
Andras Jakab,
Stefan Bauer,
Markus Rempfler,
Alessandro Crimi,
Russell Takeshi Shinohara,
Christoph Berger,
Sung Min Ha,
Martin Rozycki,
Marcel Prastawa,
Esther Alberts,
Jana Lipkova,
John Freymann,
Justin Kirby,
Michel Bilello,
Hassan Fathallah-Shaykh,
Roland Wiest,
Jan Kirschke,
Benedikt Wiestler,
Rivka Colen,
Aikaterini Kotrotsou,
Pamela Lamontagne,
Daniel Marcus,
Mikhail Milchenko
, et al. (402 additional authors not shown)
Abstract:
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles dissem…
▽ More
Gliomas are the most common primary brain malignancies, with different degrees of aggressiveness, variable prognosis and various heterogeneous histologic sub-regions, i.e., peritumoral edematous/invaded tissue, necrotic core, active and non-enhancing core. This intrinsic heterogeneity is also portrayed in their radio-phenotype, as their sub-regions are depicted by varying intensity profiles disseminated across multi-parametric magnetic resonance imaging (mpMRI) scans, reflecting varying biological properties. Their heterogeneous shape, extent, and location are some of the factors that make these tumors difficult to resect, and in some cases inoperable. The amount of resected tumor is a factor also considered in longitudinal scans, when evaluating the apparent tumor for potential diagnosis of progression. Furthermore, there is mounting evidence that accurate segmentation of the various tumor sub-regions can offer the basis for quantitative image analysis towards prediction of patient overall survival. This study assesses the state-of-the-art machine learning (ML) methods used for brain tumor image analysis in mpMRI scans, during the last seven instances of the International Brain Tumor Segmentation (BraTS) challenge, i.e., 2012-2018. Specifically, we focus on i) evaluating segmentations of the various glioma sub-regions in pre-operative mpMRI scans, ii) assessing potential tumor progression by virtue of longitudinal growth of tumor sub-regions, beyond use of the RECIST/RANO criteria, and iii) predicting the overall survival from pre-operative mpMRI scans of patients that underwent gross total resection. Finally, we investigate the challenge of identifying the best ML algorithms for each of these tasks, considering that apart from being diverse on each instance of the challenge, the multi-institutional mpMRI BraTS dataset has also been a continuously evolving/growing dataset.
△ Less
Submitted 23 April, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Known Boundary Emulation of Complex Computer Models
Authors:
Ian Vernon,
Samuel E. Jackson,
Jonathan A. Cumming
Abstract:
Computer models are now widely used across a range of scientific disciplines to describe various complex physical systems, however to perform full uncertainty quantification we often need to employ emulators. An emulator is a fast statistical construct that mimics the complex computer model, and greatly aids the vastly more computationally intensive uncertainty quantification calculations that a s…
▽ More
Computer models are now widely used across a range of scientific disciplines to describe various complex physical systems, however to perform full uncertainty quantification we often need to employ emulators. An emulator is a fast statistical construct that mimics the complex computer model, and greatly aids the vastly more computationally intensive uncertainty quantification calculations that a serious scientific analysis often requires. In some cases, the complex model can be solved far more efficiently for certain parameter settings, leading to boundaries or hyperplanes in the input parameter space where the model is essentially known. We show that for a large class of Gaussian process style emulators, multiple boundaries can be formally incorporated into the emulation process, by Bayesian updating of the emulators with respect to the boundaries, for trivial computational cost. The resulting updated emulator equations are given analytically. This leads to emulators that possess increased accuracy across large portions of the input parameter space. We also describe how a user can incorporate such boundaries within standard black box GP emulation packages that are currently available, without altering the core code. Appropriate designs of model runs in the presence of known boundaries are then analysed, with two kinds of general purpose designs proposed. We then apply the improved emulation and design methodology to an important systems biology model of hormonal crosstalk in Arabidopsis Thaliana.
△ Less
Submitted 3 May, 2019; v1 submitted 9 January, 2018;
originally announced January 2018.
-
Understanding Hormonal Crosstalk in Arabidopsis Root Development via Emulation and History Matching
Authors:
Samuel E. Jackson,
Ian Vernon,
Junli Liu,
Keith Lindsey
Abstract:
A major challenge in plant developmental biology is to understand how plant growth is coordinated by interacting hormones and genes. To meet this challenge, it is important to not only use experimental data, but also formulate a mathematical model. For the mathematical model to best describe the true biological system, it is necessary to understand the parameter space of the model, along with the…
▽ More
A major challenge in plant developmental biology is to understand how plant growth is coordinated by interacting hormones and genes. To meet this challenge, it is important to not only use experimental data, but also formulate a mathematical model. For the mathematical model to best describe the true biological system, it is necessary to understand the parameter space of the model, along with the links between the model, the parameter space and experimental observations. We develop sequential history matching methodology, using Bayesian emulation, to gain substantial insight into biological model parameter spaces. This is achieved by finding sets of acceptable parameters in accordance with successive sets of physical observations. These methods are then applied to a complex hormonal crosstalk model for Arabidopsis root growth. In this application, we demonstrate how an initial set of 22 observed trends reduce the volume of the set of acceptable inputs to a proportion of 6.1 x 10^(-7) of the original space. Additional sets of biologically relevant experimental data, each of size 5, reduce the size of this space by a further three and two orders of magnitude respectively. Hence, we provide insight into the constraints placed upon the model structure by, and the biological consequences of, measuring subsets of observations.
△ Less
Submitted 19 October, 2019; v1 submitted 4 January, 2018;
originally announced January 2018.
-
Polynomial Chaos-based Bayesian Inference of K-Profile Parametrization in a General Circulation Model of the Tropical Pacific
Authors:
Ihab Sraj,
Sarah E. Zedler,
Omar M. Knio,
Charles S. Jackson,
Ibrahim Hoteit
Abstract:
The authors present a Polynomial Chaos (PC)-based Bayesian inference method for quantifying the uncertainties of the K-Profile Parametrization (KPP) within the MIT General Circulation Model (MITgcm) of the tropical pacific. The inference of the uncertain parameters is based on a Markov Chain Monte Carlo (MCMC) scheme that utilizes a newly formulated test statistic taking into account the different…
▽ More
The authors present a Polynomial Chaos (PC)-based Bayesian inference method for quantifying the uncertainties of the K-Profile Parametrization (KPP) within the MIT General Circulation Model (MITgcm) of the tropical pacific. The inference of the uncertain parameters is based on a Markov Chain Monte Carlo (MCMC) scheme that utilizes a newly formulated test statistic taking into account the different components representing the structures of turbulent mixing on both daily and seasonal timescales in addition to the data quality, and filters for the effects of parameter perturbations over those due to changes in the wind. To avoid the prohibitive computational cost of integrating the MITgcm model at each MCMC iteration, we build a surrogate model for the test statistic using the PC method. To filter out the noise in the model predictions and avoid related convergence issues, we resort to a Basis-Pursuit-DeNoising (BPDN) compressed sensing approach to determine the PC coefficients of a representative surrogate model. The PC surrogate is then used to evaluate the test statistic in the MCMC step for sampling the posterior of the uncertain parameters. Results of the posteriors indicate good agreement with the default values for two parameters of the KPP model namely the critical bulk and gradient Richardson numbers; while the posteriors of the remaining parameters were barely informative.
△ Less
Submitted 26 October, 2015;
originally announced October 2015.
-
Network analysis reveals distinct clinical syndromes underlying acute mountain sickness
Authors:
David P Hall,
Ian JC MacCormick,
Alex T Phythian-Adams,
Nina M Rzechorzek,
David Hope-Jones,
Sorrel Cosens,
Stewart Jackson,
Matthew GD Bates,
David J Collier,
David A Hume,
Thomas Freeman,
AA Roger Thompson,
J Kenneth Baillie
Abstract:
Acute mountain sickness (AMS) is a common problem among visitors at high altitude, and may progress to life-threatening pulmonary and cerebral oedema in a minority of cases. International consensus defines AMS as a constellation of subjective, non-specific symptoms. Specifically, headache, sleep disturbance, fatigue and dizziness are given equal diagnostic weighting. Different pathophysiological m…
▽ More
Acute mountain sickness (AMS) is a common problem among visitors at high altitude, and may progress to life-threatening pulmonary and cerebral oedema in a minority of cases. International consensus defines AMS as a constellation of subjective, non-specific symptoms. Specifically, headache, sleep disturbance, fatigue and dizziness are given equal diagnostic weighting. Different pathophysiological mechanisms are now thought to underlie headache and sleep disturbance during acute exposure to high altitude. Hence, these symptoms may not belong together as a single syndrome. Using a novel visual analogue scale (VAS), we sought to undertake a systematic exploration of the symptomatology of AMS using an unbiased, data-driven approach originally designed for analysis of gene expression. Symptom scores were collected from 293 subjects during 1110 subject-days at altitudes between 3650m and 5200m on Apex expeditions to Bolivia and Kilimanjaro. Three distinct patterns of symptoms were consistently identified. Although fatigue is a ubiquitous finding, sleep disturbance and headache are each commonly reported without the other. The commonest pattern of symptoms was sleep disturbance and fatigue, with little or no headache. In subjects reporting severe headache, 40% did not report sleep disturbance. Sleep disturbance correlates poorly with other symptoms of AMS (Pearson r = 0.31 vs headache). These results challenge the accepted paradigm that AMS is a single disease process and describe at least two distinct syndromes following acute ascent to high altitude. This approach to analysing symptom patterns has potential utility in other clinical syndromes.
△ Less
Submitted 26 March, 2013;
originally announced March 2013.