-
Prediction of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated instrumental variables
Authors:
Mariyam Khan,
Adriaan-Alexander Ludl,
Sean Bankier,
Johan Bjorkegren,
Tom Michoel
Abstract:
Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal ge…
▽ More
Multivariate Mendelian randomization (MVMR) is a statistical technique that uses sets of genetic instruments to estimate the direct causal effects of multiple exposures on an outcome of interest. At genomic loci with pleiotropic gene regulatory effects, that is, loci where the same genetic variants are associated to multiple nearby genes, MVMR can potentially be used to predict candidate causal genes. However, consensus in the field dictates that the genetic instruments in MVMR must be independent, which is usually not possible when considering a group of candidate genes from the same locus. We used causal inference theory to show that MVMR with correlated instruments satisfies the instrumental set condition. This is a classical result by Brito and Pearl (2002) for structural equation models that guarantees the identifiability of causal effects in situations where multiple exposures collectively, but not individually, separate a set of instrumental variables from an outcome variable. Extensive simulations confirmed the validity and usefulness of these theoretical results even at modest sample sizes. Importantly, the causal effect estimates remain unbiased and their variance small when instruments are highly correlated. We applied MVMR with correlated instrumental variable sets at risk loci from genome-wide association studies (GWAS) for coronary artery disease using eQTL data from the STARNET study. Our method predicts causal genes at twelve loci, each associated with multiple colocated genes in multiple tissues. However, the extensive degree of regulatory pleiotropy across tissues and the limited number of causal variants in each locus still require that MVMR is run on a tissue-by-tissue basis, and testing all gene-tissue pairs at a given locus in a single model to predict causal gene-tissue combinations remains infeasible.
△ Less
Submitted 20 September, 2024; v1 submitted 11 January, 2024;
originally announced January 2024.
-
High-dimensional multi-trait GWAS by reverse prediction of genotypes
Authors:
Muhammad Ammar Malik,
Adriaan-Alexander Ludl,
Tom Michoel
Abstract:
Multi-trait genome-wide association studies (GWAS) use multi-variate statistical methods to identify associations between genetic variants and multiple correlated traits simultaneously, and have higher statistical power than independent univariate analyses of traits. Reverse regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, has emerged as a promising…
▽ More
Multi-trait genome-wide association studies (GWAS) use multi-variate statistical methods to identify associations between genetic variants and multiple correlated traits simultaneously, and have higher statistical power than independent univariate analyses of traits. Reverse regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, has emerged as a promising approach to perform multi-trait GWAS in high-dimensional settings where the number of traits exceeds the number of samples. We analyzed different machine learning methods (ridge regression, naive Bayes/independent univariate, random forests and support vector machines) for reverse regression in multi-trait GWAS, using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains to evaluate methods. We found that genotype prediction performance, in terms of root mean squared error (RMSE), allowed to distinguish between genomic regions with high and low transcriptional activity. Moreover, model feature coefficients correlated with the strength of association between variants and individual traits, and were predictive of true trans-eQTL target genes, with complementary findings across methods. Code to reproduce the analysis is available at https://github.com/michoel-lab/Reverse-Pred-GWAS
△ Less
Submitted 9 February, 2022; v1 submitted 29 October, 2021;
originally announced November 2021.
-
Comparison between instrumental variable and mediation-based methods for reconstructing causal gene networks in yeast
Authors:
Adriaan-Alexander Ludl,
Tom Michoel
Abstract:
Causal gene networks model the flow of information within a cell, but reconstructing them from omics data is challenging because correlation does not imply causation. Combining genomics and transcriptomics data from a segregating population allows to orient the direction of causality between gene expression traits using genomic variants. Instrumental-variable methods (IV) use a local expression qu…
▽ More
Causal gene networks model the flow of information within a cell, but reconstructing them from omics data is challenging because correlation does not imply causation. Combining genomics and transcriptomics data from a segregating population allows to orient the direction of causality between gene expression traits using genomic variants. Instrumental-variable methods (IV) use a local expression quantitative trait locus (eQTL) as a randomized instrument for a gene's expression level, and assign target genes based on distal eQTL associations. Mediation-based methods (ME) additionally require that distal eQTL associations are mediated by the source gene. Here we used Findr, a software providing uniform implementations of IV, ME, and coexpression-based methods, a recent dataset of 1,012 segregants from a cross between two budding yeast strains, and the YEASTRACT database of known transcriptional interactions to compare causal gene network inference methods. We found that causal inference methods result in a significant overlap with the ground-truth, whereas coexpression did not perform better than random. A subsampling analysis revealed that the performance of ME decreases at large sample sizes, due to a loss of sensitivity when residual correlations become significant. IV methods contain false positive predictions, due to genomic linkage between eQTL instruments. IV and ME methods also have complementary roles for identifying causal genes underlying transcriptional hotspots. IV methods correctly predicted STB5 targets for a hotspot centred on the transcription factor STB5, whereas ME failed due to Stb5p auto-regulating its own expression. ME suggests a new candidate gene, DNM1, for a hotspot on Chr XII, where IV methods could not distinguish between multiple genes located within the hotspot.
△ Less
Submitted 18 November, 2020; v1 submitted 14 October, 2020;
originally announced October 2020.