-
High-dimensional multi-trait GWAS by reverse prediction of genotypes
Authors:
Muhammad Ammar Malik,
Adriaan-Alexander Ludl,
Tom Michoel
Abstract:
Multi-trait genome-wide association studies (GWAS) use multi-variate statistical methods to identify associations between genetic variants and multiple correlated traits simultaneously, and have higher statistical power than independent univariate analyses of traits. Reverse regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, has emerged as a promising…
▽ More
Multi-trait genome-wide association studies (GWAS) use multi-variate statistical methods to identify associations between genetic variants and multiple correlated traits simultaneously, and have higher statistical power than independent univariate analyses of traits. Reverse regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, has emerged as a promising approach to perform multi-trait GWAS in high-dimensional settings where the number of traits exceeds the number of samples. We analyzed different machine learning methods (ridge regression, naive Bayes/independent univariate, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. We found that genotype prediction performance, in terms of root mean squared error (RMSE), allowed to distinguish between genomic regions with high and low transcriptional activity. Moreover, model feature coefficients correlated with the strength of association between variants and individual traits, and were predictive of true trans-eQTL target genes, with complementary findings across methods. Code to reproduce the analysis is available at https://github.com/michoel-lab/Reverse-Pred-GWAS
△ Less
Submitted 9 February, 2022; v1 submitted 29 October, 2021;
originally announced November 2021.
-
Restricted maximum-likelihood method for learning latent variance components in gene expression data with known and unknown confounders
Authors:
Muhammad Ammar Malik,
Tom Michoel
Abstract:
Random effect models are popular statistical models for detecting and correcting spurious sample correlations due to hidden confounders in genome-wide gene expression data. In applications where some confounding factors are known, estimating simultaneously the contribution of known and latent variance components in random effect models is a challenge that has so far relied on numerical gradient-ba…
▽ More
Random effect models are popular statistical models for detecting and correcting spurious sample correlations due to hidden confounders in genome-wide gene expression data. In applications where some confounding factors are known, estimating simultaneously the contribution of known and latent variance components in random effect models is a challenge that has so far relied on numerical gradient-based optimizers to maximize the likelihood function. This is unsatisfactory because the resulting solution is poorly characterized and the efficiency of the method may be suboptimal. Here we prove analytically that maximum-likelihood latent variables can always be chosen orthogonal to the known confounding factors, in other words, that maximum-likelihood latent variables explain sample covariances not already explained by known factors. Based on this result we propose a restricted maximum-likelihood method which estimates the latent variables by maximizing the likelihood on the restricted subspace orthogonal to the known confounding factors, and show that this reduces to probabilistic PCA on that subspace. The method then estimates the variance-covariance parameters by maximizing the remaining terms in the likelihood function given the latent variables, using a newly derived analytic solution for this problem. Compared to gradient-based optimizers, our method attains greater or equal likelihood values, can be computed using standard matrix operations, results in latent factors that don't overlap with any known factors, and has a runtime reduced by several orders of magnitude. Hence the restricted maximum-likelihood method facilitates the application of random effect modelling strategies for learning latent variance components to much larger gene expression datasets than possible with current methods.
△ Less
Submitted 4 November, 2021; v1 submitted 6 May, 2020;
originally announced May 2020.
-
On Lindley-Exponential Distribution: Properties and Application
Authors:
Deepesh Bhati,
Mohd. Aamir Malik
Abstract:
In this paper, we introduce a new distribution generated by Lindley random variable which offers a more flexible model for modelling lifetime data. Various statistical properties like distribution function, survival function, moments, entropy, and limiting distribution of extreme order statistics are established. Inference for a random sample from the proposed distribution is investigated and maxi…
▽ More
In this paper, we introduce a new distribution generated by Lindley random variable which offers a more flexible model for modelling lifetime data. Various statistical properties like distribution function, survival function, moments, entropy, and limiting distribution of extreme order statistics are established. Inference for a random sample from the proposed distribution is investigated and maximum likelihood estimation method is used for estimating parameters of this distribution. The applicability of the proposed distribution is shown through real data sets.
△ Less
Submitted 11 June, 2014;
originally announced June 2014.