-
Assumption-Lean Post-Integrated Inference with Negative Control Outcomes
Authors:
Jin-Hong Du,
Kathryn Roeder,
Larry Wasserman
Abstract:
Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference (PII) method that adjusts for latent hete…
▽ More
Data integration methods aim to extract low-dimensional embeddings from high-dimensional outcomes to remove unwanted variations, such as batch effects and unmeasured covariates, across heterogeneous datasets. However, multiple hypothesis testing after integration can be biased due to data-dependent processes. We introduce a robust post-integrated inference (PII) method that adjusts for latent heterogeneity using negative control outcomes. Leveraging causal interpretations, we derive nonparametric identifiability of the direct effects, which motivates our semiparametric inference method. Our method extends to projected direct effect estimands, accounting for hidden mediators, confounders, and moderators. These estimands remain statistically meaningful under model misspecifications and with error-prone embeddings. We provide bias quantifications and finite-sample linear expansions with uniform concentration bounds. The proposed doubly robust estimators are consistent and efficient under minimal assumptions and potential misspecification, facilitating data-adaptive estimation with machine learning algorithms. Our proposal is evaluated with random forests through simulations and analysis of single-cell CRISPR perturbed datasets with potential unmeasured confounders.
△ Less
Submitted 24 November, 2024; v1 submitted 7 October, 2024;
originally announced October 2024.
-
Simultaneous inference for generalized linear models with unmeasured confounders
Authors:
Jin-Hong Du,
Larry Wasserman,
Kathryn Roeder
Abstract:
Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under…
▽ More
Tens of thousands of simultaneous hypothesis tests are routinely performed in genomic studies to identify differentially expressed genes. However, due to unmeasured confounders, many standard statistical approaches may be substantially biased. This paper investigates the large-scale hypothesis testing problem for multivariate generalized linear models in the presence of confounding effects. Under arbitrary confounding mechanisms, we propose a unified statistical estimation and inference framework that harnesses orthogonal structures and integrates linear projections into three key stages. It begins by disentangling marginal and uncorrelated confounding effects to recover the latent coefficients. Subsequently, latent factors and primary effects are jointly estimated through lasso-type optimization. Finally, we incorporate projected and weighted bias-correction steps for hypothesis testing. Theoretically, we establish the identification conditions of various effects and non-asymptotic error bounds. We show effective Type-I error control of asymptotic $z$-tests as sample and response sizes approach infinity. Numerical experiments demonstrate that the proposed method controls the false discovery rate by the Benjamini-Hochberg procedure and is more powerful than alternative methods. By comparing single-cell RNA-seq counts from two groups of samples, we demonstrate the suitability of adjusting confounding effects when significant covariates are absent from the model.
△ Less
Submitted 15 March, 2025; v1 submitted 13 September, 2023;
originally announced September 2023.
-
Energy landscapes and dynamics of xylo-nucleic acids
Authors:
Daniel J. Sharpe,
Konstantin Roeder,
David J. Wales
Abstract:
Artificial analogues of the natural nucleic acids have attracted recent interest as a diverse class of information storage molecules capable of self-replication. In the present study, we use the computational potential energy landscape framework to investigate the structural and dynamical properties of xylo- and deoxyxylo-nucleic acids (XyNA and dXyNA), which are derived from their respective RNA…
▽ More
Artificial analogues of the natural nucleic acids have attracted recent interest as a diverse class of information storage molecules capable of self-replication. In the present study, we use the computational potential energy landscape framework to investigate the structural and dynamical properties of xylo- and deoxyxylo-nucleic acids (XyNA and dXyNA), which are derived from their respective RNA and DNA analogues by an inversion of configuration at a single chiral center in the sugar moiety of the nucleotide unit. The free energy landscapes of an octameric XyNA sequence and its dXyNA analogue demonstrate the existence of a facile conformational transition between a left-handed helix that is the global free energy minimum, and a closely competing ladder-type structure with approximately zero helicity. The separation of the competing conformational ensembles is better-defined for the dXyNA system, whereas the XyNA analogue is inherently more flexible. The former therefore appear more suitable candidates for a molecular switch. The landscapes differ qualitatively from those reported in previous studies for evolved biomolecules: they are significantly more frustrated, so that XyNAs provide an example of an unnatural system for which the conditions constituting the principle of minimal frustration are, as may be expected, violated.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Improving power in genome-wide association studies: weights tip the scale
Authors:
Kathryn Roeder,
Bernie Devlin,
Larry Wasserman
Abstract:
Genome-wide association analysis has generated much discussion about how to preserve power to detect signals despite the detrimental effect of multiple testing on power. We develop a weighted multiple testing procedure that facilitates the input of prior information in the form of groupings of tests. For each group a weight is estimated from the observed test statistics within the group. Differe…
▽ More
Genome-wide association analysis has generated much discussion about how to preserve power to detect signals despite the detrimental effect of multiple testing on power. We develop a weighted multiple testing procedure that facilitates the input of prior information in the form of groupings of tests. For each group a weight is estimated from the observed test statistics within the group. Differentially weighting groups improves the power to detect signals in likely groupings. The advantage of the grouped-weighting concept, over fixed weights based on prior information, is that it often leads to an increase in power even if many of the groupings are not correlated with the signal. Being data dependent, the procedure is remarkably robust to poor choices in groupings. Power is typically improved if one (or more) of the groups clusters multiple tests with signals, yet little power is lost when the groupings are totally random. If there is no apparent signal in a group, relative to a group that appears to have several tests with signals, the former group will be down-weighted relative to the latter. If no groups show apparent signals, then the weights will be approximately equal. The only restriction on the procedure is that the number of groups be small, relative to the total number of tests performed.
△ Less
Submitted 3 January, 2007;
originally announced January 2007.