-
Detection of evolutionary shifts in variance under an Ornsten-Uhlenbeck model
Authors:
Wensha Zhang,
Lam Si Tung Ho,
Toby Kenney
Abstract:
Abrupt environmental changes can lead to evolutionary shifts in not only the optimal trait value, but also the rate of adaptation and the diffusion variance in trait evolution. While several methods exist for detecting shifts in optimal values, few explicitly model shifts in both evolutionary variance and adaptation rates. We use a multi-optima and multi-variance Ornstein-Uhlenbeck (OU) process mo…
▽ More
Abrupt environmental changes can lead to evolutionary shifts in not only the optimal trait value, but also the rate of adaptation and the diffusion variance in trait evolution. While several methods exist for detecting shifts in optimal values, few explicitly model shifts in both evolutionary variance and adaptation rates. We use a multi-optima and multi-variance Ornstein-Uhlenbeck (OU) process model to describe trait evolution with shifts in both optimal value and diffusion variance and analyze how covariance between species is affected when shifts in variance occur along the phylogeny. We propose a new method that simultaneously detects shifts in both variance and optimal values by formulating the problem as a variable selection task using an L1-penalized loss function. Our method is implemented in the R package ShiVa (Detection of evolutionary Shifts in Variance). Through simulations, we compare ShiVa with methods that only consider shifts in optimal values (l1ou; PhylogeneticEM), and PCMFit. Our method demonstrates improved predictive ability and significantly reduces false positives in detecting optimal value shifts when variance shifts are present. When only shifts in optimal value occur, our method performs comparably to existing approaches. Applying ShiVa to empirical data from cordylid lizards , we find that it outperforms l1ou and PhylogeneticEM, achieving the highest log-likelihood and lowest BIC.
△ Less
Submitted 31 March, 2025; v1 submitted 29 December, 2023;
originally announced December 2023.
-
Rank Selection for Non-negative Matrix Factorization
Authors:
Yun Cai,
Hong Gu,
Toby Kenney
Abstract:
Non-Negative Matrix Factorization (NMF) is a widely used dimension reduction method that factorizes a non-negative data matrix into two lower dimensional non-negative matrices: One is the basis or feature matrix which consists of the variables and the other is the coefficients matrix which is the projections of data points to the new basis. The features can be interpreted as sub-structures of the…
▽ More
Non-Negative Matrix Factorization (NMF) is a widely used dimension reduction method that factorizes a non-negative data matrix into two lower dimensional non-negative matrices: One is the basis or feature matrix which consists of the variables and the other is the coefficients matrix which is the projections of data points to the new basis. The features can be interpreted as sub-structures of the data. The number of sub-structures in the feature matrix is also called the rank which is the only tuning parameter in NMF. An appropriate rank will extract the key latent features while minimizing the noise from the original data. In this paper, we develop a novel rank selection method based on hypothesis testing, using a deconvolved bootstrap distribution to assess the significance level accurately despite the large amount of optimization error. In the simulation section, we compare our method with a rank selection method based on hypothesis testing using bootstrap distribution without deconvolution, and with a cross-validated imputation method1. Through simulations, we demonstrate that our method is not only accurate at estimating the true ranks for NMF especially when the features are hard to distinguish but also efficient at computation. When applied to real microbiome data (e.g. OTU data and functional metagenomic data), our method also shows the ability to extract interpretable sub-communities in the data.
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Evolutionary shift detection with ensemble variable selection
Authors:
Wensha Zhang,
Toby Kenney,
Lam Si Tung Ho
Abstract:
1. Abrupt environmental changes can lead to evolutionary shifts in trait evolution. Identifying these shifts is an important step in understanding the evolutionary history of phenotypes.
2. We propose an ensemble variable selection method (R package ELPASO) for the evolutionary shift detection task and compare it with existing methods (R packages l1ou and PhylogeneticEM) under several scenarios.…
▽ More
1. Abrupt environmental changes can lead to evolutionary shifts in trait evolution. Identifying these shifts is an important step in understanding the evolutionary history of phenotypes.
2. We propose an ensemble variable selection method (R package ELPASO) for the evolutionary shift detection task and compare it with existing methods (R packages l1ou and PhylogeneticEM) under several scenarios.
3. The performances of methods are highly dependent on the selection criterion. When the signal sizes are small, the methods using the Bayesian information criterion (BIC) have better performances. And when the signal sizes are large enough, the methods using the phylogenetic Bayesian information criterion (pBIC) (Khabbazian et al., 2016) have better performance. Moreover, the performance is heavily impacted by measurement error and tree reconstruction error.
4. Ensemble method + pBIC tends to perform less conservatively than l1ou + pBIC, and Ensemble method + BIC is more conservatively than l1ou + BIC. PhylogeneticEM is even more conservative with small signal sizes and falls between l1ou + pBIC and Ensemble method + BIC with large signal sizes. The results can differ between the methods, but none clearly outperforms the others. By applying multiple methods to a single dataset, we can access the robustness of each detected shift, based on the agreement among methods.
△ Less
Submitted 12 April, 2022;
originally announced April 2022.
-
Deconvolution density estimation with penalised MLE
Authors:
Yun Cai,
Hong Gu,
Toby Kenney
Abstract:
Deconvolution is the important problem of estimating the distribution of a quantity of interest from a sample with additive measurement error. Nearly all methods in the literature are based on Fourier transformation because it is mathematically a very neat solution. However, in practice these methods are unstable, and produce bad estimates when signal-noise ratio or sample size are low. In this pa…
▽ More
Deconvolution is the important problem of estimating the distribution of a quantity of interest from a sample with additive measurement error. Nearly all methods in the literature are based on Fourier transformation because it is mathematically a very neat solution. However, in practice these methods are unstable, and produce bad estimates when signal-noise ratio or sample size are low. In this paper, we develop a new deconvolution method based on maximum likelihood with a smoothness penalty. We show that our new method has much better performance than existing methods, particularly for small sample size or signal-noise ratio.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
Stochastic Generalized Lotka-Volterra Model with An Application to Learning Microbial Community Structures
Authors:
Libai Xu,
Ximing Xu,
Dehan Kong,
Hong Gu,
Toby Kenney
Abstract:
Inferring microbial community structure based on temporal metagenomics data is an important goal in microbiome studies. The deterministic generalized Lotka-Volterra differential (GLV) equations have been used to model the dynamics of microbial data. However, these approaches fail to take random environmental fluctuations into account, which may negatively impact the estimates. We propose a new sto…
▽ More
Inferring microbial community structure based on temporal metagenomics data is an important goal in microbiome studies. The deterministic generalized Lotka-Volterra differential (GLV) equations have been used to model the dynamics of microbial data. However, these approaches fail to take random environmental fluctuations into account, which may negatively impact the estimates. We propose a new stochastic GLV (SGLV) differential equation model, where the random perturbations of Brownian motion in the model can naturally account for the external environmental effects on the microbial community. We establish new conditions and show various mathematical properties of the solutions including general existence and uniqueness, stationary distribution, and ergodicity. We further develop approximate maximum likelihood estimators based on discrete observations and systematically investigate the consistency and asymptotic normality of the proposed estimators. Our method is demonstrated through simulation studies and an application to the well-known "moving picture" temporal microbial dataset.
△ Less
Submitted 22 September, 2020;
originally announced September 2020.
-
SuRF: a New Method for Sparse Variable Selection, with Application in Microbiome Data Analysis
Authors:
Lihui Liu,
Hong Gu,
Johan Van Limbergen,
Toby Kenney
Abstract:
In this paper, we present a new variable selection method for regression and classification purposes. Our method, called Subsampling Ranking Forward selection (SuRF), is based on LASSO penalised regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an…
▽ More
In this paper, we present a new variable selection method for regression and classification purposes. Our method, called Subsampling Ranking Forward selection (SuRF), is based on LASSO penalised regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome data sets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.
△ Less
Submitted 13 September, 2019;
originally announced September 2019.
-
Poisson PCA: Poisson Measurement Error corrected PCA, with Application to Microbiome Data
Authors:
Toby Kenney,
Tianshu Huang,
Hong Gu
Abstract:
In this paper, we study the problem of computing a Principal Component Analysis of data affected by Poisson noise. We assume samples are drawn from independent Poisson distributions. We want to estimate principle components of a fixed transformation of the latent Poisson means. Our motivating example is microbiome data, though the methods apply to many other situations. We develop a semiparametric…
▽ More
In this paper, we study the problem of computing a Principal Component Analysis of data affected by Poisson noise. We assume samples are drawn from independent Poisson distributions. We want to estimate principle components of a fixed transformation of the latent Poisson means. Our motivating example is microbiome data, though the methods apply to many other situations. We develop a semiparametric approach to correct the bias of variance estimators, both for untransformed and transformed (with particular attention to log-transformation) Poisson means. Furthermore, we incorporate methods for correcting different exposure or sequencing depth in the data. In addition to identifying the principal components, we also address the non-trivial problem of computing the principal scores in this semiparametric framework. Most previous approaches tend to take a more parametric line. For example the Poisson-log-normal (PLN) model, approach. We compare our method with the PLN approach and find that our method is better at identifying the main principal components of the latent log-transformed Poisson means, and as a further major advantage, takes far less time to compute. Comparing methods on real data, we see that our method also appears to be more robust to outliers than the parametric method.
△ Less
Submitted 26 April, 2019;
originally announced April 2019.
-
Prior Distributions for Ranking Problems
Authors:
Toby Kenney,
Hao He,
Hong Gu
Abstract:
The ranking problem is to order a collection of units by some unobserved parameter, based on observations from the associated distribution. This problem arises naturally in a number of contexts, such as business, where we may want to rank potential projects by profitability; or science, where we may want to rank variables potentially associated with some trait by the strength of the association. M…
▽ More
The ranking problem is to order a collection of units by some unobserved parameter, based on observations from the associated distribution. This problem arises naturally in a number of contexts, such as business, where we may want to rank potential projects by profitability; or science, where we may want to rank variables potentially associated with some trait by the strength of the association. Most approaches to this problem are empirical Bayesian, where we use the data to estimate the hyperparameters of the prior distribution, then use that distribution to estimate the unobserved parameter values. There are a number of different approaches to this problem, based on different loss functions for mis-ranking units. However, little has been done on the choice of prior distribution. Typical approaches involve choosing a conjugate prior for convenience, and estimating the hyperparameters by MLE from the whole dataset. In this paper, we look in more detail at the effect of choice of prior distribution on Bayesian ranking. We focus on the use of posterior mean for ranking, but many of our conclusions should apply to other ranking criteria, and it is not too difficult to adapt our methods to other choices of prior distributions.
△ Less
Submitted 27 October, 2016;
originally announced October 2016.
-
The Adequate Bootstrap
Authors:
Toby Kenney,
Hong Gu
Abstract:
There is a fundamental disconnect between what is tested in a model adequacy test, and what we would like to test. The usual approach is to test the null hypothesis "Model M is the true model." However, Model M is never the true model. A model might still be useful even if we have enough data to reject it. In this paper, we present a technique to assess the adequacy of a model from the philosophic…
▽ More
There is a fundamental disconnect between what is tested in a model adequacy test, and what we would like to test. The usual approach is to test the null hypothesis "Model M is the true model." However, Model M is never the true model. A model might still be useful even if we have enough data to reject it. In this paper, we present a technique to assess the adequacy of a model from the philosophical standpoint that we know the model is not true, but we want to know if it is useful.
Our solution to this problem is to measure the parameter uncertainty in our estimates caused by the model uncertainty. We use bootstrap inference on samples of a smaller size, for which the model cannot be rejected. We use a model adequacy test to choose a bootstrap size with limited probability of rejecting the model and perform inference for samples of this size based on a nonparametric bootstrap. Our idea is that if we base our inference on a sample size at which we do not reject the model, then we should be happy with this inference, because we would have been confident in it if our original dataset had been this size.
△ Less
Submitted 21 August, 2016;
originally announced August 2016.