-
Climate & BCG: Effects on COVID-19 Death Growth Rates
Authors:
Chris Finlay,
Bruce A. Bassett
Abstract:
Multiple studies have suggested the spread of COVID-19 is affected by factors such as climate, BCG vaccinations, pollution and blood type. We perform a joint study of these factors using the death growth rates of 40 regions worldwide with both machine learning and Bayesian methods. We find weak, non-significant (< 3$σ$) evidence for temperature and relative humidity as factors in the spread of COV…
▽ More
Multiple studies have suggested the spread of COVID-19 is affected by factors such as climate, BCG vaccinations, pollution and blood type. We perform a joint study of these factors using the death growth rates of 40 regions worldwide with both machine learning and Bayesian methods. We find weak, non-significant (< 3$σ$) evidence for temperature and relative humidity as factors in the spread of COVID-19 but little or no evidence for BCG vaccination prevalence or $\text{PM}_{2.5}$ pollution. The only variable detected at a statistically significant level (>3$σ$) is the rate of positive COVID-19 tests, with higher positive rates correlating with higher daily growth of deaths.
△ Less
Submitted 10 July, 2020;
originally announced July 2020.
-
A Flexible Framework for Anomaly Detection via Dimensionality Reduction
Authors:
Alireza Vafaei Sadr,
Bruce A. Bassett,
Martin Kunz
Abstract:
Anomaly detection is challenging, especially for large datasets in high dimensions. Here we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. We release DRAMA, a general python package that implements the general framework with a wide range of built-in options. We test DRAMA on a wide variety of simulated and real datasets, in up to 3000 d…
▽ More
Anomaly detection is challenging, especially for large datasets in high dimensions. Here we explore a general anomaly detection framework based on dimensionality reduction and unsupervised clustering. We release DRAMA, a general python package that implements the general framework with a wide range of built-in options. We test DRAMA on a wide variety of simulated and real datasets, in up to 3000 dimensions, and find it robust and highly competitive with commonly-used anomaly detection algorithms, especially in high dimensions. The flexibility of the DRAMA framework allows for significant optimization once some examples of anomalies are available, making it ideal for online anomaly detection, active learning and highly unbalanced datasets.
△ Less
Submitted 9 September, 2019;
originally announced September 2019.
-
Bayesian Anomaly Detection and Classification
Authors:
Ethan Roberts,
Bruce A. Bassett,
Michelle Lochner
Abstract:
Statistical uncertainties are rarely incorporated in machine learning algorithms, especially for anomaly detection. Here we present the Bayesian Anomaly Detection And Classification (BADAC) formalism, which provides a unified statistical approach to classification and anomaly detection within a hierarchical Bayesian framework. BADAC deals with uncertainties by marginalising over the unknown, true,…
▽ More
Statistical uncertainties are rarely incorporated in machine learning algorithms, especially for anomaly detection. Here we present the Bayesian Anomaly Detection And Classification (BADAC) formalism, which provides a unified statistical approach to classification and anomaly detection within a hierarchical Bayesian framework. BADAC deals with uncertainties by marginalising over the unknown, true, value of the data. Using simulated data with Gaussian noise, BADAC is shown to be superior to standard algorithms in both classification and anomaly detection performance in the presence of uncertainties, though with significantly increased computational cost. Additionally, BADAC provides well-calibrated classification probabilities, valuable for use in scientific pipelines. We show that BADAC can work in online mode and is fairly robust to model errors, which can be diagnosed through model-selection methods. In addition it can perform unsupervised new class detection and can naturally be extended to search for anomalous subsets of data. BADAC is therefore ideal where computational cost is not a limiting factor and statistical rigour is important. We discuss approximations to speed up BADAC, such as the use of Gaussian processes, and finally introduce a new metric, the Rank-Weighted Score (RWS), that is particularly suited to evaluating the ability of algorithms to detect anomalies.
△ Less
Submitted 22 February, 2019;
originally announced February 2019.
-
DeepSource: Point Source Detection using Deep Learning
Authors:
A. Vafaei Sadr,
Etienne. E. Vos,
Bruce A. Bassett,
Zafiirah Hosenie,
N. Oozeer,
Michelle Lochner
Abstract:
Point source detection at low signal-to-noise is challenging for astronomical surveys, particularly in radio interferometry images where the noise is correlated. Machine learning is a promising solution, allowing the development of algorithms tailored to specific telescope arrays and science cases. We present DeepSource - a deep learning solution - that uses convolutional neural networks to achiev…
▽ More
Point source detection at low signal-to-noise is challenging for astronomical surveys, particularly in radio interferometry images where the noise is correlated. Machine learning is a promising solution, allowing the development of algorithms tailored to specific telescope arrays and science cases. We present DeepSource - a deep learning solution - that uses convolutional neural networks to achieve these goals. DeepSource enhances the Signal-to-Noise Ratio (SNR) of the original map and then uses dynamic blob detection to detect sources. Trained and tested on two sets of 500 simulated 1 deg x 1 deg MeerKAT images with a total of 300,000 sources, DeepSource is essentially perfect in both purity and completeness down to SNR = 4 and outperforms PyBDSF in all metrics. For uniformly-weighted images it achieves a Purity x Completeness (PC) score at SNR = 3 of 0.73, compared to 0.31 for the best PyBDSF model. For natural-weighting we find a smaller improvement of ~40% in the PC score at SNR = 3. If instead we ask where either of the purity or completeness first drop to 90%, we find that DeepSource reaches this value at SNR = 3.6 compared to the 4.3 of PyBDSF (natural-weighting). A key advantage of DeepSource is that it can learn to optimally trade off purity and completeness for any science case under consideration. Our results show that deep learning is a promising approach to point source detection in astronomical images.
△ Less
Submitted 7 July, 2018;
originally announced July 2018.
-
Automated Classification of Text Sentiment
Authors:
Emmanuel Dufourq,
Bruce A. Bassett
Abstract:
The ability to identify sentiment in text, referred to as sentiment analysis, is one which is natural to adult humans. This task is, however, not one which a computer can perform by default. Identifying sentiments in an automated, algorithmic manner will be a useful capability for business and research in their search to understand what consumers think about their products or services and to under…
▽ More
The ability to identify sentiment in text, referred to as sentiment analysis, is one which is natural to adult humans. This task is, however, not one which a computer can perform by default. Identifying sentiments in an automated, algorithmic manner will be a useful capability for business and research in their search to understand what consumers think about their products or services and to understand human sociology. Here we propose two new Genetic Algorithms (GAs) for the task of automated text sentiment analysis. The GAs learn whether words occurring in a text corpus are either sentiment or amplifier words, and their corresponding magnitude. Sentiment words, such as 'horrible', add linearly to the final sentiment. Amplifier words in contrast, which are typically adjectives/adverbs like 'very', multiply the sentiment of the following word. This increases, decreases or negates the sentiment of the following word. The sentiment of the full text is then the sum of these terms. This approach grows both a sentiment and amplifier dictionary which can be reused for other purposes and fed into other machine learning algorithms. We report the results of multiple experiments conducted on large Amazon data sets. The results reveal that our proposed approach was able to outperform several public and/or commercial sentiment analysis algorithms.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
EDEN: Evolutionary Deep Networks for Efficient Machine Learning
Authors:
Emmanuel Dufourq,
Bruce A. Bassett
Abstract:
Deep neural networks continue to show improved performance with increasing depth, an encouraging trend that implies an explosion in the possible permutations of network architectures and hyperparameters for which there is little intuitive guidance. To address this increasing complexity, we propose Evolutionary DEep Networks (EDEN), a computationally efficient neuro-evolutionary algorithm which int…
▽ More
Deep neural networks continue to show improved performance with increasing depth, an encouraging trend that implies an explosion in the possible permutations of network architectures and hyperparameters for which there is little intuitive guidance. To address this increasing complexity, we propose Evolutionary DEep Networks (EDEN), a computationally efficient neuro-evolutionary algorithm which interfaces to any deep neural network platform, such as TensorFlow. We show that EDEN evolves simple yet successful architectures built from embedding, 1D and 2D convolutional, max pooling and fully connected layers along with their hyperparameters. Evaluation of EDEN across seven image and sentiment classification datasets shows that it reliably finds good networks -- and in three cases achieves state-of-the-art results -- even on a single GPU, in just 6-24 hours. Our study provides a first attempt at applying neuro-evolution to the creation of 1D convolutional networks for sentiment analysis including the optimisation of the embedding layer.
△ Less
Submitted 26 September, 2017;
originally announced September 2017.
-
Text Compression for Sentiment Analysis via Evolutionary Algorithms
Authors:
Emmanuel Dufourq,
Bruce A. Bassett
Abstract:
Can textual data be compressed intelligently without losing accuracy in evaluating sentiment? In this study, we propose a novel evolutionary compression algorithm, PARSEC (PARts-of-Speech for sEntiment Compression), which makes use of Parts-of-Speech tags to compress text in a way that sacrifices minimal classification accuracy when used in conjunction with sentiment analysis algorithms. An analys…
▽ More
Can textual data be compressed intelligently without losing accuracy in evaluating sentiment? In this study, we propose a novel evolutionary compression algorithm, PARSEC (PARts-of-Speech for sEntiment Compression), which makes use of Parts-of-Speech tags to compress text in a way that sacrifices minimal classification accuracy when used in conjunction with sentiment analysis algorithms. An analysis of PARSEC with eight commercial and non-commercial sentiment analysis algorithms on twelve English sentiment data sets reveals that accurate compression is possible with (0%, 1.3%, 3.3%) loss in sentiment classification accuracy for (20%, 50%, 75%) data compression with PARSEC using LingPipe, the most accurate of the sentiment algorithms. Other sentiment analysis algorithms are more severely affected by compression. We conclude that significant compression of text data is possible for sentiment analysis depending on the accuracy demands of the specific application and the specific sentiment analysis algorithm used.
△ Less
Submitted 20 September, 2017;
originally announced September 2017.
-
Automated Problem Identification: Regression vs Classification via Evolutionary Deep Networks
Authors:
Emmanuel Dufourq,
Bruce A. Bassett
Abstract:
Regression or classification? This is perhaps the most basic question faced when tackling a new supervised learning problem. We present an Evolutionary Deep Learning (EDL) algorithm that automatically solves this by identifying the question type with high accuracy, along with a proposed deep architecture. Typically, a significant amount of human insight and preparation is required prior to executi…
▽ More
Regression or classification? This is perhaps the most basic question faced when tackling a new supervised learning problem. We present an Evolutionary Deep Learning (EDL) algorithm that automatically solves this by identifying the question type with high accuracy, along with a proposed deep architecture. Typically, a significant amount of human insight and preparation is required prior to executing machine learning algorithms. For example, when creating deep neural networks, the number of parameters must be selected in advance and furthermore, a lot of these choices are made based upon pre-existing knowledge of the data such as the use of a categorical cross entropy loss function. Humans are able to study a dataset and decide whether it represents a classification or a regression problem, and consequently make decisions which will be applied to the execution of the neural network. We propose the Automated Problem Identification (API) algorithm, which uses an evolutionary algorithm interface to TensorFlow to manipulate a deep neural network to decide if a dataset represents a classification or a regression problem. We test API on 16 different classification, regression and sentiment analysis datasets with up to 10,000 features and up to 17,000 unique target values. API achieves an average accuracy of $96.3\%$ in identifying the problem type without hardcoding any insights about the general characteristics of regression or classification problems. For example, API successfully identifies classification problems even with 1000 target values. Furthermore, the algorithm recommends which loss function to use and also recommends a neural network architecture. Our work is therefore a step towards fully automated machine learning.
△ Less
Submitted 3 July, 2017;
originally announced July 2017.
-
Bayes Factors via Savage-Dickey Supermodels
Authors:
A. Mootoovaloo,
Bruce A. Bassett,
M. Kunz
Abstract:
We outline a new method to compute the Bayes Factor for model selection which bypasses the Bayesian Evidence. Our method combines multiple models into a single, nested, Supermodel using one or more hyperparameters. Since the models are now nested the Bayes Factors between the models can be efficiently computed using the Savage-Dickey Density Ratio (SDDR). In this way model selection becomes a prob…
▽ More
We outline a new method to compute the Bayes Factor for model selection which bypasses the Bayesian Evidence. Our method combines multiple models into a single, nested, Supermodel using one or more hyperparameters. Since the models are now nested the Bayes Factors between the models can be efficiently computed using the Savage-Dickey Density Ratio (SDDR). In this way model selection becomes a problem of parameter estimation. We consider two ways of constructing the supermodel in detail: one based on combined models, and a second based on combined likelihoods. We report on these two approaches for a Gaussian linear model for which the Bayesian evidence can be calculated analytically and a toy nonlinear problem. Unlike the combined model approach, where a standard Monte Carlo Markov Chain (MCMC) struggles, the combined-likelihood approach fares much better in providing a reliable estimate of the log-Bayes Factor. This scheme potentially opens the way to computationally efficient ways to compute Bayes Factors in high dimensions that exploit the good scaling properties of MCMC, as compared to methods such as nested sampling that fail for high dimensions.
△ Less
Submitted 7 September, 2016;
originally announced September 2016.
-
Generalised Fisher Matrices
Authors:
A. F. Heavens,
M. Seikel,
B. D. Nord,
M. Aich,
Y. Bouffanais,
B. A. Bassett,
M. P. Hobson
Abstract:
The Fisher Information Matrix formalism is extended to cases where the data is divided into two parts (X,Y), where the expectation value of Y depends on X according to some theoretical model, and X and Y both have errors with arbitrary covariance. In the simplest case, (X,Y) represent data pairs of abscissa and ordinate, in which case the analysis deals with the case of data pairs with errors in b…
▽ More
The Fisher Information Matrix formalism is extended to cases where the data is divided into two parts (X,Y), where the expectation value of Y depends on X according to some theoretical model, and X and Y both have errors with arbitrary covariance. In the simplest case, (X,Y) represent data pairs of abscissa and ordinate, in which case the analysis deals with the case of data pairs with errors in both coordinates, but X can be any measured quantities on which Y depends. The analysis applies for arbitrary covariance, provided all errors are gaussian, and provided the errors in X are small, both in comparison with the scale over which the expected signal Y changes, and with the width of the prior distribution. This generalises the Fisher Matrix approach, which normally only considers errors in the `ordinate' Y. In this work, we include errors in X by marginalising over latent variables, effectively employing a Bayesian hierarchical model, and deriving the Fisher Matrix for this more general case. The methods here also extend to likelihood surfaces which are not gaussian in the parameter space, and so techniques such as DALI (Derivative Approximation for Likelihoods) can be generalised straightforwardly to include arbitrary gaussian data error covariances. For simple mock data and theoretical models, we compare to Markov Chain Monte Carlo experiments, illustrating the method with cosmological supernova data. We also include the new method in the Fisher4Cast software.
△ Less
Submitted 14 September, 2014; v1 submitted 10 April, 2014;
originally announced April 2014.
-
BEAMS: separating the wheat from the chaff in supernova analysis
Authors:
Martin Kunz,
Renée Hlozek,
Bruce A. Bassett,
Mathew Smith,
James Newling,
Melvin Varughese
Abstract:
We introduce Bayesian Estimation Applied to Multiple Species (BEAMS), an algorithm designed to deal with parameter estimation when using contaminated data. We present the algorithm and demonstrate how it works with the help of a Gaussian simulation. We then apply it to supernova data from the Sloan Digital Sky Survey (SDSS), showing how the resulting confidence contours of the cosmological paramet…
▽ More
We introduce Bayesian Estimation Applied to Multiple Species (BEAMS), an algorithm designed to deal with parameter estimation when using contaminated data. We present the algorithm and demonstrate how it works with the help of a Gaussian simulation. We then apply it to supernova data from the Sloan Digital Sky Survey (SDSS), showing how the resulting confidence contours of the cosmological parameters shrink significantly.
△ Less
Submitted 29 October, 2012;
originally announced October 2012.
-
Parameter Estimation with BEAMS in the presence of biases and correlations
Authors:
James Newling,
Bruce. A. Bassett,
Renée Hlozek,
Martin Kunz,
Mathew Smith,
Melvin Varughese
Abstract:
The original formulation of BEAMS - Bayesian Estimation Applied to Multiple Species - showed how to use a dataset contaminated by points of multiple underlying types to perform unbiased parameter estimation. An example is cosmological parameter estimation from a photometric supernova sample contaminated by unknown Type Ibc and II supernovae. Where other methods require data cuts to increase purity…
▽ More
The original formulation of BEAMS - Bayesian Estimation Applied to Multiple Species - showed how to use a dataset contaminated by points of multiple underlying types to perform unbiased parameter estimation. An example is cosmological parameter estimation from a photometric supernova sample contaminated by unknown Type Ibc and II supernovae. Where other methods require data cuts to increase purity, BEAMS uses all of the data points in conjunction with their probabilities of being each type. Here we extend the BEAMS formalism to allow for correlations between the data and the type probabilities of the objects as can occur in realistic cases. We show with simple simulations that this extension can be crucial, providing a 50% reduction in parameter estimation variance when such correlations do exist. We then go on to perform tests to quantify the importance of the type probabilities, one of which illustrates the effect of biasing the probabilities in various ways. Finally, a general presentation of the selection bias problem is given, and discussed in the context of future photometric supernova surveys and BEAMS, which lead to specific recommendations for future supernova surveys.
△ Less
Submitted 27 October, 2011;
originally announced October 2011.
-
Statistical Classification Techniques for Photometric Supernova Typing
Authors:
James Newling,
Melvin Varughese,
Bruce A. Bassett,
Heather Campbell,
Renée Hlozek,
Martin Kunz,
Hubert Lampeitl,
Bryony Martin,
Robert Nichol,
David Parkinson,
Mathew Smith
Abstract:
Future photometric supernova surveys will produce vastly more candidates than can be followed up spectroscopically, highlighting the need for effective classification methods based on lightcurves alone. Here we introduce boosting and kernel density estimation techniques which have minimal astrophysical input, and compare their performance on 20,000 simulated Dark Energy Survey lightcurves. We demo…
▽ More
Future photometric supernova surveys will produce vastly more candidates than can be followed up spectroscopically, highlighting the need for effective classification methods based on lightcurves alone. Here we introduce boosting and kernel density estimation techniques which have minimal astrophysical input, and compare their performance on 20,000 simulated Dark Energy Survey lightcurves. We demonstrate that these methods are comparable to the best template fitting methods currently used, and in particular do not require the redshift of the host galaxy or candidate. However both methods require a training sample that is representative of the full population, so typical spectroscopic supernova subsamples will lead to poor performance. To enable the full potential of such blind methods, we recommend that representative training samples should be used and so specific attention should be given to their creation in the design phase of future photometric surveys.
△ Less
Submitted 8 October, 2010; v1 submitted 5 October, 2010;
originally announced October 2010.