-
Accounting for Skill in Trend, Variability, and Autocorrelation Facilitates Better Multi-Model Projections: Application to the AMOC and Temperature Time Series
Authors:
Roman Olson,
Soon-Il An,
Yanan Fan,
Jason P. Evans
Abstract:
We present a novel quasi-Bayesian method to weight multiple dynamical models by their skill at capturing both potentially non-linear trends and first-order autocorrelated variability of the underlying process, and to make weighted probabilistic projections. We validate the method using a suite of one-at-a-time cross-validation experiments involving Atlantic meridional overturning circulation (AMOC…
▽ More
We present a novel quasi-Bayesian method to weight multiple dynamical models by their skill at capturing both potentially non-linear trends and first-order autocorrelated variability of the underlying process, and to make weighted probabilistic projections. We validate the method using a suite of one-at-a-time cross-validation experiments involving Atlantic meridional overturning circulation (AMOC), its temperature-based index, as well as Korean summer mean maximum temperature. In these experiments the method tends to exhibit superior skill over a trend-only Bayesian model averaging weighting method in terms of weight assignment and probabilistic forecasts. Specifically, mean credible interval width, and mean absolute error of the projections tend to improve. We apply the method to a problem of projecting summer mean maximum temperature change over Korea by the end of the 21st century using a multi-model ensemble. Compared to the trend-only method, the new method appreciably sharpens the probability distribution function (pdf) and increases future most likely, median, and mean warming in Korea. The method is flexible, with a potential to improve forecasts in geosciences and other fields.
△ Less
Submitted 17 April, 2019; v1 submitted 7 November, 2018;
originally announced November 2018.
-
Relief-Based Feature Selection: Introduction and Review
Authors:
Ryan J. Urbanowicz,
Melissa Meeker,
William LaCava,
Randal S. Olson,
Jason H. Moore
Abstract:
Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. intera…
▽ More
Feature selection plays a critical role in biomedical data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that have gained appeal by striking an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.
△ Less
Submitted 2 April, 2018; v1 submitted 22 November, 2017;
originally announced November 2017.
-
Considerations of automated machine learning in clinical metabolic profiling: Altered homocysteine plasma concentration associated with metformin exposure
Authors:
Alena Orlenko,
Jason H. Moore,
Patryk Orzechowski,
Randal S. Olson,
Junmei Cairns,
Pedro J. Caraballo,
Richard M. Weinshilboum,
Liewei Wang,
Matthew K. Breitenstein
Abstract:
With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must b…
▽ More
With the maturation of metabolomics science and proliferation of biobanks, clinical metabolic profiling is an increasingly opportunistic frontier for advancing translational clinical research. Automated Machine Learning (AutoML) approaches provide exciting opportunity to guide feature selection in agnostic metabolic profiling endeavors, where potentially thousands of independent data points must be evaluated. In previous research, AutoML using high-dimensional data of varying types has been demonstrably robust, outperforming traditional approaches. However, considerations for application in clinical metabolic profiling remain to be evaluated. Particularly, regarding the robustness of AutoML to identify and adjust for common clinical confounders. In this study, we present a focused case study regarding AutoML considerations for using the Tree-Based Optimization Tool (TPOT) in metabolic profiling of exposure to metformin in a biobank cohort. First, we propose a tandem rank-accuracy measure to guide agnostic feature selection and corresponding threshold determination in clinical metabolic profiling endeavors. Second, while AutoML, using default parameters, demonstrated potential to lack sensitivity to low-effect confounding clinical covariates, we demonstrated residual training and adjustment of metabolite features as an easily applicable approach to ensure AutoML adjustment for potential confounding characteristics. Finally, we present increased homocysteine with long-term exposure to metformin as a potentially novel, non-replicated metabolite association suggested by TPOT; an association not identified in parallel clinical metabolic profiling endeavors. While considerations are recommended, including adjustment approaches for clinical confounders, AutoML presents an exciting tool to enhance clinical metabolic profiling and advance translational research endeavors.
△ Less
Submitted 9 October, 2017;
originally announced October 2017.
-
Data-driven Advice for Applying Machine Learning to Bioinformatics Problems
Authors:
Randal S. Olson,
William La Cava,
Zairah Mustahsan,
Akshay Varik,
Jason H. Moore
Abstract:
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual compari…
▽ More
As the bioinformatics field grows, it must keep pace not only with new data but with new algorithms. Here we contribute a thorough analysis of 13 state-of-the-art, commonly used machine learning algorithms on a set of 165 publicly available classification problems in order to provide data-driven algorithm recommendations to current researchers. We present a number of statistical and visual comparisons of algorithm performance and quantify the effect of model selection and algorithm tuning for each algorithm and dataset. The analysis culminates in the recommendation of five algorithms with hyperparameters that maximize classifier performance across the tested problems, as well as general guidelines for applying machine learning to supervised classification problems.
△ Less
Submitted 7 January, 2018; v1 submitted 8 August, 2017;
originally announced August 2017.
-
Toward the automated analysis of complex diseases in genome-wide association studies using genetic programming
Authors:
Andrew Sohn,
Randal S. Olson,
Jason H. Moore
Abstract:
Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data scienc…
▽ More
Machine learning has been gaining traction in recent years to meet the demand for tools that can efficiently analyze and make sense of the ever-growing databases of biomedical data in health care systems around the world. However, effectively using machine learning methods requires considerable domain expertise, which can be a barrier of entry for bioinformaticians new to computational data science methods. Therefore, off-the-shelf tools that make machine learning more accessible can prove invaluable for bioinformaticians. To this end, we have developed an open source pipeline optimization tool (TPOT-MDR) that uses genetic programming to automatically design machine learning pipelines for bioinformatics studies. In TPOT-MDR, we implement Multifactor Dimensionality Reduction (MDR) as a feature construction method for modeling higher-order feature interactions, and combine it with a new expert knowledge-guided feature selector for large biomedical data sets. We demonstrate TPOT-MDR's capabilities using a combination of simulated and real world data sets from human genetics and find that TPOT-MDR significantly outperforms modern machine learning methods such as logistic regression and eXtreme Gradient Boosting (XGBoost). We further analyze the best pipeline discovered by TPOT-MDR for a real world problem and highlight TPOT-MDR's ability to produce a high-accuracy solution that is also easily interpretable.
△ Less
Submitted 6 February, 2017;
originally announced February 2017.
-
A composite likelihood approach to computer model calibration using high-dimensional spatial data
Authors:
Won Chang,
Murali Haran,
Roman Olson,
Klaus Keller
Abstract:
Computer models are used to model complex processes in various disciplines. Often, a key source of uncertainty in the behavior of complex computer models is uncertainty due to unknown model input parameters. Statistical computer model calibration is the process of inferring model parameter values, along with associated uncertainties, from observations of the physical process and from model outputs…
▽ More
Computer models are used to model complex processes in various disciplines. Often, a key source of uncertainty in the behavior of complex computer models is uncertainty due to unknown model input parameters. Statistical computer model calibration is the process of inferring model parameter values, along with associated uncertainties, from observations of the physical process and from model outputs at various parameter settings. Observations and model outputs are often in the form of high-dimensional spatial fields, especially in the environmental sciences. Sound statistical inference may be computationally challenging in such situations. Here we introduce a composite likelihood-based approach to perform computer model calibration with high-dimensional spatial data. While composite likelihood has been studied extensively in the context of spatial statistics, computer model calibration using composite likelihood poses several new challenges. We propose a computationally efficient approach for Bayesian computer model calibration using composite likelihood. We also develop a methodology based on asymptotic theory for adjusting the composite likelihood posterior distribution so that it accurately represents posterior uncertainties. We study the application of our new approach in the context of calibration for a climate model.
△ Less
Submitted 31 July, 2013;
originally announced August 2013.
-
Fast dimension-reduced climate model calibration and the effect of data aggregation
Authors:
Won Chang,
Murali Haran,
Roman Olson,
Klaus Keller
Abstract:
How will the climate system respond to anthropogenic forcings? One approach to this question relies on climate model projections. Current climate projections are considerably uncertain. Characterizing and, if possible, reducing this uncertainty is an area of ongoing research. We consider the problem of making projections of the North Atlantic meridional overturning circulation (AMOC). Uncertaintie…
▽ More
How will the climate system respond to anthropogenic forcings? One approach to this question relies on climate model projections. Current climate projections are considerably uncertain. Characterizing and, if possible, reducing this uncertainty is an area of ongoing research. We consider the problem of making projections of the North Atlantic meridional overturning circulation (AMOC). Uncertainties about climate model parameters play a key role in uncertainties in AMOC projections. When the observational data and the climate model output are high-dimensional spatial data sets, the data are typically aggregated due to computational constraints. The effects of aggregation are unclear because statistically rigorous approaches for model parameter inference have been infeasible for high-resolution data. Here we develop a flexible and computationally efficient approach using principal components and basis expansions to study the effect of spatial data aggregation on parametric and projection uncertainties. Our Bayesian reduced-dimensional calibration approach allows us to study the effect of complicated error structures and data-model discrepancies on our ability to learn about climate model parameters from high-dimensional data. Considering high-dimensional spatial observations reduces the effect of deep uncertainty associated with prior specifications for the data-model discrepancy. Also, using the unaggregated data results in sharper projections based on our climate model. Our computationally efficient approach may be widely applicable to a variety of high-dimensional computer model calibration problems.
△ Less
Submitted 31 July, 2014; v1 submitted 6 March, 2013;
originally announced March 2013.