-
The CAST package for training and assessment of spatial prediction models in R
Authors:
Hanna Meyer,
Marvin Ludwig,
Carles Milà,
Jan Linnenbrink,
Fabian Schumacher
Abstract:
One key task in environmental science is to map environmental variables continuously in space or even in space and time. Machine learning algorithms are frequently used to learn from local field observations to make spatial predictions by estimating the value of the variable of interest in places where it has not been measured. However, the application of machine learning strategies for spatial ma…
▽ More
One key task in environmental science is to map environmental variables continuously in space or even in space and time. Machine learning algorithms are frequently used to learn from local field observations to make spatial predictions by estimating the value of the variable of interest in places where it has not been measured. However, the application of machine learning strategies for spatial mapping involves additional challenges compared to "non-spatial" prediction tasks that often originate from spatial autocorrelation and from training data that are not independent and identically distributed.
In the past few years, we developed a number of methods to support the application of machine learning for spatial data which involves the development of suitable cross-validation strategies for performance assessment and model selection, spatial feature selection, and methods to assess the area of applicability of the trained models. The intention of the CAST package is to support the application of machine learning strategies for predictive mapping by implementing such methods and making them available for easy integration into modelling workflows.
Here we introduce the CAST package and its core functionalities. At the case study of mapping plant species richness, we will go through the different steps of the modelling workflow and show how CAST can be used to support more reliable spatial predictions.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Reference and Probability-Matching Priors for the Parameters of a Univariate Student $t$-Distribution
Authors:
A. J. van der Merwe,
M. J. von Maltitz,
J. H. Meyer
Abstract:
In this paper reference and probability-matching priors are derived for the univariate Student $t$-distribution. These priors generally lead to procedures with properties frequentists can relate to while still retaining Bayes validity. The priors are tested by performing simulation studies. The focus is on the relative mean squared error from the posterior median ($MSE(ν)/ν$) and on the frequentis…
▽ More
In this paper reference and probability-matching priors are derived for the univariate Student $t$-distribution. These priors generally lead to procedures with properties frequentists can relate to while still retaining Bayes validity. The priors are tested by performing simulation studies. The focus is on the relative mean squared error from the posterior median ($MSE(ν)/ν$) and on the frequentist coverage of the 95\% credibility intervals for a sample size of $n=30$. Average interval lengths of the credibility intervals as well as the modes of the interval lengths based on 2000 simulations are also considered. The performance of the priors are also tested on real data, namely daily logarithmic returns of IBM stocks.
△ Less
Submitted 15 April, 2021;
originally announced April 2021.
-
Surprisal-Triggered Conditional Computation with Neural Networks
Authors:
Loren Lugosch,
Derek Nowrouzezahrai,
Brett H. Meyer
Abstract:
Autoregressive neural network models have been used successfully for sequence generation, feature extraction, and hypothesis scoring. This paper presents yet another use for these models: allocating more computation to more difficult inputs. In our model, an autoregressive model is used both to extract features and to predict observations in a stream of input observations. The surprisal of the inp…
▽ More
Autoregressive neural network models have been used successfully for sequence generation, feature extraction, and hypothesis scoring. This paper presents yet another use for these models: allocating more computation to more difficult inputs. In our model, an autoregressive model is used both to extract features and to predict observations in a stream of input observations. The surprisal of the input, measured as the negative log-likelihood of the current observation according to the autoregressive model, is used as a measure of input difficulty. This in turn determines whether a small, fast network, or a big, slow network, is used. Experiments on two speech recognition tasks show that our model can match the performance of a baseline in which the big network is always used with 15% fewer FLOPs.
△ Less
Submitted 2 June, 2020;
originally announced June 2020.
-
Predicting into unknown space? Estimating the area of applicability of spatial prediction models
Authors:
Hanna Meyer,
Edzer Pebesma
Abstract:
Predictive modelling using machine learning has become very popular for spatial mapping of the environment. Models are often applied to make predictions far beyond sampling locations where new geographic locations might considerably differ from the training data in their environmental properties. However, areas in the predictor space without support of training data are problematic. Since the mode…
▽ More
Predictive modelling using machine learning has become very popular for spatial mapping of the environment. Models are often applied to make predictions far beyond sampling locations where new geographic locations might considerably differ from the training data in their environmental properties. However, areas in the predictor space without support of training data are problematic. Since the model has no knowledge about these environments, predictions have to be considered uncertain.
Estimating the area to which a prediction model can be reliably applied is required. Here, we suggest a methodology that delineates the "area of applicability" (AOA) that we define as the area, for which the cross-validation error of the model applies. We first propose a "dissimilarity index" (DI) that is based on the minimum distance to the training data in the predictor space, with predictors being weighted by their respective importance in the model. The AOA is then derived by applying a threshold based on the DI of the training data where the DI is calculated with respect to the cross-validation strategy used for model training. We test for the ideal threshold by using simulated data and compare the prediction error within the AOA with the cross-validation error of the model. We illustrate the approach using a simulated case study.
Our simulation study suggests a threshold on DI to define the AOA at the .95 quantile of the DI in the training data. Using this threshold, the prediction error within the AOA is comparable to the cross-validation RMSE of the model, while the cross-validation error does not apply outside the AOA. This applies to models being trained with randomly distributed training data, as well as when training data are clustered in space and where spatial cross-validation is applied.
We suggest to report the AOA alongside predictions, complementary to validation measures.
△ Less
Submitted 16 May, 2020;
originally announced May 2020.
-
Importance of spatial predictor variable selection in machine learning applications -- Moving from data reproduction to spatial prediction
Authors:
Hanna Meyer,
Christoph Reudenbach,
Stephan Wöllauer,
Thomas Nauss
Abstract:
Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training…
▽ More
Machine learning algorithms find frequent application in spatial prediction of biotic and abiotic environmental variables. However, the characteristics of spatial data, especially spatial autocorrelation, are widely ignored. We hypothesize that this is problematic and results in models that can reproduce training data but are unable to make spatial predictions beyond the locations of the training samples. We assume that not only spatial validation strategies but also spatial variable selection is essential for reliable spatial predictions. We introduce two case studies that use remote sensing to predict land cover and the leaf area index for the "Marburg Open Forest", an open research and education site of Marburg University, Germany. We use the machine learning algorithm Random Forests to train models using non-spatial and spatial cross-validation strategies to understand how spatial variable selection affects the predictions. Our findings confirm that spatial cross-validation is essential in preventing overoptimistic model performance. We further show that highly autocorrelated predictors (such as geolocation variables, e.g. latitude, longitude) can lead to considerable overfitting and result in models that can reproduce the training data but fail in making spatial predictions. The problem becomes apparent in the visual assessment of the spatial predictions that show clear artefacts that can be traced back to a misinterpretation of the spatially autocorrelated predictors by the algorithm. Spatial variable selection could automatically detect and remove such variables that lead to overfitting, resulting in reliable spatial prediction patterns and improved statistical spatial model performance. We conclude that in addition to spatial validation, a spatial variable selection must be considered in spatial predictions of ecological data to produce reliable predictions.
△ Less
Submitted 21 August, 2019;
originally announced August 2019.
-
Learning Recurrent Binary/Ternary Weights
Authors:
Arash Ardakani,
Zhengyun Ji,
Sean C. Smithson,
Brett H. Meyer,
Warren J. Gross
Abstract:
Recurrent neural networks (RNNs) have shown excellent performance in processing sequence data. However, they are both complex and memory intensive due to their recursive nature. These limitations make RNNs difficult to embed on mobile devices requiring real-time processes with limited hardware resources. To address the above issues, we introduce a method that can learn binary and ternary weights d…
▽ More
Recurrent neural networks (RNNs) have shown excellent performance in processing sequence data. However, they are both complex and memory intensive due to their recursive nature. These limitations make RNNs difficult to embed on mobile devices requiring real-time processes with limited hardware resources. To address the above issues, we introduce a method that can learn binary and ternary weights during the training phase to facilitate hardware implementations of RNNs. As a result, using this approach replaces all multiply-accumulate operations by simple accumulations, bringing significant benefits to custom hardware in terms of silicon area and power consumption. On the software side, we evaluate the performance (in terms of accuracy) of our method using long short-term memories (LSTMs) on various sequential models including sequence classification and language modeling. We demonstrate that our method achieves competitive results on the aforementioned tasks while using binary/ternary weights during the runtime. On the hardware side, we present custom hardware for accelerating the recurrent computations of LSTMs with binary/ternary weights. Ultimately, we show that LSTMs with binary/ternary weights can achieve up to 12x memory saving and 10x inference speedup compared to the full-precision implementation on an ASIC platform.
△ Less
Submitted 24 January, 2019; v1 submitted 28 September, 2018;
originally announced September 2018.
-
Hyperspectral Data Analysis in R: the hsdar Package
Authors:
Lukas W. Lehnert,
Hanna Meyer,
Wolfgang A. Obermeier,
Brenner Silva,
Bianca Regeling,
Jörg Bendix
Abstract:
Hyperspectral remote sensing is a promising tool for a variety of applications including ecology, geology, analytical chemistry and medical research. This article presents the new \hsdar package for R statistical software, which performs a variety of analysis steps taken during a typical hyperspectral remote sensing approach. The package introduces a new class for efficiently storing large hypersp…
▽ More
Hyperspectral remote sensing is a promising tool for a variety of applications including ecology, geology, analytical chemistry and medical research. This article presents the new \hsdar package for R statistical software, which performs a variety of analysis steps taken during a typical hyperspectral remote sensing approach. The package introduces a new class for efficiently storing large hyperspectral datasets such as hyperspectral cubes within R. The package includes several important hyperspectral analysis tools such as continuum removal, normalized ratio indices and integrates two widely used radiation transfer models. In addition, the package provides methods to directly use the functionality of the caret package for machine learning tasks. Two case studies demonstrate the package's range of functionality: First, plant leaf chlorophyll content is estimated and second, cancer in the human larynx is detected from hyperspectral data.
△ Less
Submitted 14 May, 2018;
originally announced May 2018.
-
Three-dimensional Cardiovascular Imaging-Genetics: A Mass Univariate Framework
Authors:
Carlo Biffi,
Antonio de Marvao,
Mark I. Attard,
Timothy J. W. Dawes,
Nicola Whiffin,
Wenjia Bai,
Wenzhe Shi,
Catherine Francis,
Hannah Meyer,
Rachel Buchan,
Stuart A. Cook,
Daniel Rueckert,
Declan P. O'Regan
Abstract:
MOTIVATION: Left ventricular (LV) hypertrophy is a strong predictor of cardiovascular outcomes, but its genetic regulation remains largely unexplained. Conventional phenotyping relies on manual calculation of LV mass and wall thickness, but advanced cardiac image analysis presents an opportunity for high-throughput mapping of genotype-phenotype associations in three dimensions (3D). RESULTS: High-…
▽ More
MOTIVATION: Left ventricular (LV) hypertrophy is a strong predictor of cardiovascular outcomes, but its genetic regulation remains largely unexplained. Conventional phenotyping relies on manual calculation of LV mass and wall thickness, but advanced cardiac image analysis presents an opportunity for high-throughput mapping of genotype-phenotype associations in three dimensions (3D). RESULTS: High-resolution cardiac magnetic resonance images were automatically segmented in 1,124 healthy volunteers to create a 3D shape model of the heart. Mass univariate regression was used to plot a 3D effect-size map for the association between wall thickness and a set of predictors at each vertex in the mesh. The vertices where a significant effect exists were determined by applying threshold-free cluster enhancement to boost areas of signal with spatial contiguity. Experiments on simulated phenotypic signals and SNP replication show that this approach offers a substantial gain in statistical power for cardiac genotype-phenotype associations while providing good control of the false discovery rate. This framework models the effects of genetic variation throughout the heart and can be automatically applied to large population cohorts. AVAILABILITY: The proposed approach has been coded in an R package freely available at https://doi.org/10.5281/zenodo.834610 together with the clinical data used in this work.
△ Less
Submitted 13 September, 2017; v1 submitted 22 June, 2017;
originally announced June 2017.