-
DBCal: Density Based Calibration of classifier predictions for uncertainty quantification
Authors:
Alex Hagen,
Karl Pazdernik,
Nicole LaHaye,
Marjolein Oostrom
Abstract:
Measurement of uncertainty of predictions from machine learning methods is important across scientific domains and applications. We present, to our knowledge, the first such technique that quantifies the uncertainty of predictions from a classifier and accounts for both the classifier's belief and performance. We prove that our method provides an accurate estimate of the probability that the outpu…
▽ More
Measurement of uncertainty of predictions from machine learning methods is important across scientific domains and applications. We present, to our knowledge, the first such technique that quantifies the uncertainty of predictions from a classifier and accounts for both the classifier's belief and performance. We prove that our method provides an accurate estimate of the probability that the outputs of two neural networks are correct by showing an expected calibration error of less than 0.2% on a binary classifier, and less than 3% on a semantic segmentation network with extreme class imbalance. We empirically show that the uncertainty returned by our method is an accurate measurement of the probability that the classifier's prediction is correct and, therefore has broad utility in uncertainty propagation.
△ Less
Submitted 31 March, 2022;
originally announced April 2022.
-
Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance
Authors:
Alex Hagen,
Shane Jackson,
James Kahn,
Jan Strube,
Isabel Haide,
Karl Pazdernik,
Connor Hainje
Abstract:
Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result th…
▽ More
Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks.
△ Less
Submitted 25 June, 2021;
originally announced June 2021.
-
Estimating Basis Functions in Massive Fields under the Spatial Mixed Effects Model
Authors:
Karl T. Pazdernik,
Ranjan Maitra
Abstract:
Spatial prediction is commonly achieved under the assumption of a Gaussian random field (GRF) by obtaining maximum likelihood estimates of parameters, and then using the kriging equations to arrive at predicted values. For massive datasets, fixed rank kriging using the Expectation-Maximization (EM) algorithm for estimation has been proposed as an alternative to the usual but computationally prohib…
▽ More
Spatial prediction is commonly achieved under the assumption of a Gaussian random field (GRF) by obtaining maximum likelihood estimates of parameters, and then using the kriging equations to arrive at predicted values. For massive datasets, fixed rank kriging using the Expectation-Maximization (EM) algorithm for estimation has been proposed as an alternative to the usual but computationally prohibitive kriging method. The method reduces computation cost of estimation by redefining the spatial process as a linear combination of basis functions and spatial random effects. A disadvantage of this method is that it imposes constraints on the relationship between the observed locations and the knots. We develop an alternative method that utilizes the Spatial Mixed Effects (SME) model, but allows for additional flexibility by estimating the range of the spatial dependence between the observations and the knots via an Alternating Expectation Conditional Maximization (AECM) algorithm. Experiments show that our methodology improves estimation without sacrificing prediction accuracy while also minimizing the additional computational burden of extra parameter estimation. The methodology is applied to a temperature data set archived by the United States National Climate Data Center, with improved results over previous methodology.
△ Less
Submitted 12 March, 2020;
originally announced March 2020.
-
Reduced Basis Kriging for Big Spatial Fields
Authors:
Karl T. Pazdernik,
Ranjan Maitra,
Douglas Nychka,
Stephen Sain
Abstract:
In spatial statistics, a common method for prediction over a Gaussian random field (GRF) is maximum likelihood estimation combined with kriging. For massive data sets, kriging is computationally intensive, both in terms of CPU time and memory, and so fixed rank kriging has been proposed as a solution. The method however still involves operations on large matrices, so we develop an alteration to th…
▽ More
In spatial statistics, a common method for prediction over a Gaussian random field (GRF) is maximum likelihood estimation combined with kriging. For massive data sets, kriging is computationally intensive, both in terms of CPU time and memory, and so fixed rank kriging has been proposed as a solution. The method however still involves operations on large matrices, so we develop an alteration to this method by utilizing the approximations made in fixed rank kriging combined with restricted maximum likelihood estimation and sparse matrix methodology. Experiments show that our methodology can provide additional gains in computational efficiency over fixed-rank kriging without loss of accuracy in prediction. The methodology is applied to climate data archived by the United States National Climate Data Center, with very good results.
△ Less
Submitted 25 April, 2018; v1 submitted 22 February, 2018;
originally announced March 2018.