Search | arXiv e-print repository

Multivariate regression with missing response data for modelling regional DNA methylation QTLs

Authors: Shomoita Alam, Yixiao Zeng, Sasha Bernatsky, Marie Hudson, Inés Colmegna, David A. Stephens, Celia M. T. Greenwood, Archer Y. Yang

Abstract: Identifying genetic regulators of DNA methylation (mQTLs) with multivariate models enhances statistical power, but is challenged by missing data from bisulfite sequencing. Standard imputation-based methods can introduce bias, limiting reliable inference. We propose \texttt{missoNet}, a novel convex estimation framework that jointly estimates regression coefficients and the precision matrix from da… ▽ More Identifying genetic regulators of DNA methylation (mQTLs) with multivariate models enhances statistical power, but is challenged by missing data from bisulfite sequencing. Standard imputation-based methods can introduce bias, limiting reliable inference. We propose \texttt{missoNet}, a novel convex estimation framework that jointly estimates regression coefficients and the precision matrix from data with missing responses. By using unbiased surrogate estimators, our three-stage procedure avoids imputation while simultaneously performing variable selection and learning the conditional dependence structure among responses. We establish theoretical error bounds, and our simulations demonstrate that \texttt{missoNet} consistently outperforms existing methods in both prediction and sparsity recovery. In a real-world mQTL analysis of the CARTaGENE cohort, \texttt{missoNet} achieved superior predictive accuracy and false-discovery control on a held-out validation set, identifying known and credible novel genetic associations. The method offers a robust, efficient, and theoretically grounded tool for genomic analyses, and is available as an R package. △ Less

Submitted 8 July, 2025; originally announced July 2025.

arXiv:2410.10082 [pdf, other]

fastHDMI: Fast Mutual Information Estimation for High-Dimensional Data

Authors: Kai Yang, Masoud Asgharian, Nikhil Bhagwat, Jean-Baptiste Poline, Celia M. T. Greenwood

Abstract: In this paper, we introduce fastHDMI, a Python package designed for efficient variable screening in high-dimensional datasets, particularly neuroimaging data. This work pioneers the application of three mutual information estimation methods for neuroimaging variable selection, a novel approach implemented via fastHDMI. These advancements enhance our ability to analyze the complex structures of neu… ▽ More In this paper, we introduce fastHDMI, a Python package designed for efficient variable screening in high-dimensional datasets, particularly neuroimaging data. This work pioneers the application of three mutual information estimation methods for neuroimaging variable selection, a novel approach implemented via fastHDMI. These advancements enhance our ability to analyze the complex structures of neuroimaging datasets, providing improved tools for variable selection in high-dimensional spaces. Using the preprocessed ABIDE dataset, we evaluate the performance of these methods through extensive simulations. The tests cover a range of conditions, including linear and nonlinear associations, as well as continuous and binary outcomes. Our results highlight the superiority of the FFTKDE-based mutual information estimation for feature screening in continuous nonlinear outcomes, while binning-based methods outperform others for binary outcomes with nonlinear probability preimages. For linear simulations, both Pearson correlation and FFTKDE-based methods show comparable performance for continuous outcomes, while Pearson excels in binary outcomes with linear probability preimages. A comprehensive case study using the ABIDE dataset further demonstrates fastHDMI's practical utility, showcasing the predictive power of models built from variables selected using our screening techniques. This research affirms the computational efficiency and methodological strength of fastHDMI, significantly enriching the toolkit available for neuroimaging analysis. △ Less

Submitted 13 October, 2024; originally announced October 2024.

Comments: 31 pages, 5 figures

arXiv:2105.12286 [pdf, other]

An algorithm-based multiple detection influence measure for high dimensional regression using expectile

Authors: Amadou Barry, Nikhil Bhagwat, Bratislav Misic, Jean-Baptiste Poline, Celia M. T. Greenwood

Abstract: The identification of influential observations is an important part of data analysis that can prevent erroneous conclusions drawn from biased estimators. However, in high dimensional data, this identification is challenging. Classical and recently-developed methods often perform poorly when there are multiple influential observations in the same dataset. In particular, current methods can fail whe… ▽ More The identification of influential observations is an important part of data analysis that can prevent erroneous conclusions drawn from biased estimators. However, in high dimensional data, this identification is challenging. Classical and recently-developed methods often perform poorly when there are multiple influential observations in the same dataset. In particular, current methods can fail when there is masking several influential observations with similar characteristics, or swamping when the influential observations are near the boundary of the space spanned by well-behaved observations. Therefore, we propose an algorithm-based, multi-step, multiple detection procedure to identify influential observations that addresses current limitations. Our three-step algorithm to identify and capture undesirable variability in the data, $\asymMIP,$ is based on two complementary statistics, inspired by asymmetric correlations, and built on expectiles. Simulations demonstrate higher detection power than competing methods. Use of the resulting asymptotic distribution leads to detection of influential observations without the need for computationally demanding procedures such as the bootstrap. The application of our method to the Autism Brain Imaging Data Exchange neuroimaging dataset resulted in a more balanced and accurate prediction of brain maturity based on cortical thickness. See our GitHub for a free R package that implements our algorithm: \texttt{asymMIP} (\url{github.com/AmBarry/hidetify}). △ Less

Submitted 25 May, 2021; originally announced May 2021.

Comments: 38 pages, 11 figures

arXiv:2101.07374 [pdf, other]

Detecting differentially methylated regions in bisulfite sequencing data using quasi-binomial mixed models with smooth covariate effect estimates

Authors: Kaiqiong Zhao, Karim Oualkacha, Lajmi Lakhal-Chaieb, Aurélie Labbe, Kathleen Klein, Sasha Bernatsky, Marie Hudson, Inés Colmegna, Celia M. T. Greenwood

Abstract: Identifying disease-associated changes in DNA methylation can help to gain a better understanding of disease etiology. Bisulfite sequencing technology allows the generation of methylation profiles at single base of DNA. We previously developed a method for estimating smooth covariate effects and identifying differentially methylated regions (DMRs) from bisulfite sequencing data, which copes with e… ▽ More Identifying disease-associated changes in DNA methylation can help to gain a better understanding of disease etiology. Bisulfite sequencing technology allows the generation of methylation profiles at single base of DNA. We previously developed a method for estimating smooth covariate effects and identifying differentially methylated regions (DMRs) from bisulfite sequencing data, which copes with experimental errors and variable read depths; this method utilizes the binomial distribution to characterize the variability in the methylated counts. However, bisulfite sequencing data frequently include low-count integers and can exhibit over or under dispersion relative to the binomial distribution. We present a substantial improvement to our previous work by proposing a quasi-likelihood-based regional testing approach which accounts for multiplicative and additive sources of dispersion. We demonstrate the theoretical properties of the resulting tests, as well as their marginal and conditional interpretations. Simulations show that the proposed method provides correct inference for smooth covariate effects and captures the major methylation patterns with excellent power. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:1811.07356 [pdf, other]

A Tracy-Widom Empirical Estimator For Valid P-values With High-Dimensional Datasets

Authors: Maxime Turgeon, Celia MT Greenwood, Aurelie Labbe

Abstract: Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. Classical multivariate approaches like Multivariate Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be used to study relationships between such multivariate datasets. Yet, special care is requi… ▽ More Recent technological advances in many domains including both genomics and brain imaging have led to an abundance of high-dimensional and correlated data being routinely collected. Classical multivariate approaches like Multivariate Analysis of Variance (MANOVA) and Canonical Correlation Analysis (CCA) can be used to study relationships between such multivariate datasets. Yet, special care is required with high-dimensional data, as the test statistics may be ill-defined and classical inference procedures break down. In this work, we explain how valid p-values can be derived for these multivariate methods even in high dimensional datasets. Our main contribution is an empirical estimator for the largest root distribution of a singular double Wishart problem; this general framework underlies many common multivariate analysis approaches. From a small number of permutations of the data, we estimate the location and scale parameters of a parametric Tracy-Widom family that provides a good approximation of this distribution. Through simulations, we show that this estimated distribution also leads to valid p-values that can be used for high-dimensional inference. We then apply our approach to a pathway-based analysis of the association between DNA methylation and disease type in patients with systemic auto-immune rheumatic diseases. △ Less

Submitted 18 November, 2018; originally announced November 2018.

arXiv:1712.04058 [pdf]

Distinguishing differential susceptibility, diathesis-stress and vantage sensitivity: beyond the single gene and environment model

Authors: Alexia Jolicoeur-Martineau, Jay Belsky, Eszter Szekely, Keith F. Widaman, Michael Pluess, Celia Greenwood, Ashley Wazana

Abstract: Currently, two main approaches exist to distinguish differential susceptibility from diathesis-stress and vantage sensitivity in genotype x environment interaction (GxE) research: Regions of significance (RoS) and competitive-confirmatory approaches. Each is limited by their single-gene/single-environment foci given that most phenotypes are the product of multiple interacting genetic and environme… ▽ More Currently, two main approaches exist to distinguish differential susceptibility from diathesis-stress and vantage sensitivity in genotype x environment interaction (GxE) research: Regions of significance (RoS) and competitive-confirmatory approaches. Each is limited by their single-gene/single-environment foci given that most phenotypes are the product of multiple interacting genetic and environmental factors. We thus addressed these two concerns in a recently developed R package (LEGIT) for constructing GxE interaction models with latent genetic and environmental scores using alternating optimization. Herein we test, by means of computer simulation, diverse GxE models in the context of both single and multiple genes and environments. Results indicate that the RoS and competitive-confirmatory approaches were highly accurate when the sample size was large, whereas the latter performed better in small samples and for small effect sizes. The confirmatory approach generally had good accuracy (a) when effect size was moderate and N >= 500 and (b) when effect size was large and N >= 250, whereas RoS performed poorly. Computational tools to determine the type of GxE of multiple genes and environments are provided as extensions in our LEGIT R package. △ Less

Submitted 21 August, 2018; v1 submitted 11 December, 2017; originally announced December 2017.

arXiv:1703.08111 [pdf]

Alternating optimization for GxE modelling with weighted genetic and environmental scores: examples from the MAVAN study

Authors: Alexia Jolicoeur-Martineau, Ashley Wazana, Eszter Szekely, Meir Steiner, Alison S. Fleming, James L. Kennedy, Michael J. Meaney, Celia M. T. Greenwood

Abstract: Motivated by the goal of expanding currently existing genotype x environment interaction (GxE) models to simultaneously include multiple genetic variants and environmental exposures in a parsimonious way, we developed a novel method to estimate the parameters in a GxE model, where G is a weighted sum of genetic variants (genetic score) and E is a weighted sum of environments (environmental score).… ▽ More Motivated by the goal of expanding currently existing genotype x environment interaction (GxE) models to simultaneously include multiple genetic variants and environmental exposures in a parsimonious way, we developed a novel method to estimate the parameters in a GxE model, where G is a weighted sum of genetic variants (genetic score) and E is a weighted sum of environments (environmental score). The approach uses alternating optimization to estimate the parameters of the GxE model. This is an iterative process where the genetic score weights, the environmental score weights, and the main model parameters are estimated in turn assuming the other parameters to be constant. This technique can be used to construct relatively complex interaction models that are constrained to a particular structure, and hence contain fewer parameters. We present the model as a two-way interaction longitudinal mixed model, for which ordinary linear regression is a special case, but it can easily be extended to be compatible with k-way interaction models and generalized linear mixed models. The model is implemented in R (LEGIT package) and using SAS macros (LEGIT_SAS). Here we present examples from the Maternal Adversity, Vulnerability, and Neurodevelopment (MAVAN) study where we improve significantly upon already existing models using alternating optimization. Furthermore, through simulations, we demonstrate the power and validity of this approach even with small sample sizes. △ Less

Submitted 31 August, 2017; v1 submitted 23 March, 2017; originally announced March 2017.

Showing 1–7 of 7 results for author: Greenwood, C