Search | arXiv e-print repository

The permuted score test for robust differential expression analysis

Authors: Timothy Barry, Ziang Niu, Eugene Katsevich, Xihong Lin

Abstract: Negative binomial (NB) regression is a popular method for identifying differentially expressed genes in genomics data, such as bulk and single-cell RNA sequencing data. However, NB regression makes stringent parametric and asymptotic assumptions, which can fail to hold in practice, leading to excess false positive and false negative results. We propose the permuted score test, a new strategy for r… ▽ More Negative binomial (NB) regression is a popular method for identifying differentially expressed genes in genomics data, such as bulk and single-cell RNA sequencing data. However, NB regression makes stringent parametric and asymptotic assumptions, which can fail to hold in practice, leading to excess false positive and false negative results. We propose the permuted score test, a new strategy for robust regression based on permuting score test statistics. The permuted score test provably controls type-I error across a much broader range of settings than standard NB regression while nevertheless approximately matching standard NB regression with respect to power (when the assumptions of standard NB regression obtain) and computational efficiency. We accelerate the permuted score test by leveraging emerging techniques for sequential Monte-Carlo testing and novel algorithms for efficiently computing GLM score tests. We apply the permuted score test to real and simulated RNA sequencing data, finding that it substantially improves upon the error control of existing NB regression implementations, including DESeq2. The permuted score test could enhance the reliability of differential expression analysis across diverse biological contexts. △ Less

Submitted 6 January, 2025; originally announced January 2025.

arXiv:2409.09512 [pdf, other]

Doubly robust and computationally efficient high-dimensional variable selection

Authors: Abhinav Chakraborty, Jeffrey Zhang, Eugene Katsevich

Abstract: The variable selection problem is to discover which of a large set of predictors is associated with an outcome of interest, conditionally on the other predictors. This problem has been widely studied, but existing approaches lack either power against complex alternatives, robustness to model misspecification, computational efficiency, or quantification of evidence against individual hypotheses. We… ▽ More The variable selection problem is to discover which of a large set of predictors is associated with an outcome of interest, conditionally on the other predictors. This problem has been widely studied, but existing approaches lack either power against complex alternatives, robustness to model misspecification, computational efficiency, or quantification of evidence against individual hypotheses. We present tower PCM (tPCM), a statistically and computationally efficient solution to the variable selection problem that does not suffer from these shortcomings. tPCM adapts the best aspects of two existing procedures that are based on similar functionals: the holdout randomization test (HRT) and the projected covariance measure (PCM). The former is a model-X test that utilizes many resamples and few machine learning fits, while the latter is an asymptotic doubly-robust style test for a single hypothesis that requires no resamples and many machine learning fits. Theoretically, we demonstrate the validity of tPCM, and perhaps surprisingly, the asymptotic equivalence of HRT, PCM, and tPCM. In so doing, we clarify the relationship between two methods from two separate literatures. An extensive simulation study verifies that tPCM can have significant computational savings compared to HRT and PCM, while maintaining nearly identical power. △ Less

Submitted 14 September, 2024; originally announced September 2024.

arXiv:2407.08915 [pdf, other]

The saddlepoint approximation for averages of conditionally independent random variables

Authors: Ziang Niu, Jyotishka Ray Choudhury, Eugene Katsevich

Abstract: Motivated by the application of saddlepoint approximations to resampling-based statistical tests, we prove that the Lugannani-Rice formula has vanishing relative error when applied to approximate conditional tail probabilities of averages of conditionally independent random variables. In a departure from existing work, this result is valid under only sub-exponential assumptions on the summands, an… ▽ More Motivated by the application of saddlepoint approximations to resampling-based statistical tests, we prove that the Lugannani-Rice formula has vanishing relative error when applied to approximate conditional tail probabilities of averages of conditionally independent random variables. In a departure from existing work, this result is valid under only sub-exponential assumptions on the summands, and does not require any assumptions on their smoothness or lattice structure. The derived saddlepoint approximation result can be directly applied to resampling-based hypothesis tests, including bootstrap, sign-flipping and conditional randomization tests. We exemplify this by providing the first rigorous justification of a saddlepoint approximation for the sign-flipping test of symmetry about the origin, initially proposed in 1955. On the way to our main result, we establish a conditional Berry-Esseen inequality for sums of conditionally independent random variables, which may be of independent interest. △ Less

Submitted 30 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

arXiv:2407.08911 [pdf, other]

Computationally efficient and statistically accurate conditional independence testing with spaCRT

Authors: Ziang Niu, Jyotishka Ray Choudhury, Eugene Katsevich

Abstract: We introduce the saddlepoint approximation-based conditional randomization test (spaCRT), a novel conditional independence test that effectively balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens. Resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computational c… ▽ More We introduce the saddlepoint approximation-based conditional randomization test (spaCRT), a novel conditional independence test that effectively balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens. Resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computational cost. The spaCRT leverages a saddlepoint approximation to the resampling distribution of the dCRT test statistic, achieving very similar finite-sample statistical performance with significantly reduced computational demands. We prove that the spaCRT $p$-value approximates the dCRT $p$-value with vanishing relative error, and that these two tests are asymptotically equivalent. Through extensive simulations and real data analysis, we demonstrate that the spaCRT controls Type-I error and maintains high power, outperforming other asymptotic and resampling-based tests. Our method is particularly well-suited for large-scale single-cell CRISPR screen analyses, facilitating the efficient and accurate assessment of perturbation-gene associations. △ Less

Submitted 14 September, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

arXiv:2211.14698 [pdf, other]

Reconciling model-X and doubly robust approaches to conditional independence testing

Authors: Ziang Niu, Abhinav Chakraborty, Oliver Dukes, Eugene Katsevich

Abstract: Model-X approaches to testing conditional independence between a predictor and an outcome variable given a vector of covariates usually assume exact knowledge of the conditional distribution of the predictor given the covariates. Nevertheless, model-X methodologies are often deployed with this conditional distribution learned in sample. We investigate the consequences of this choice through the le… ▽ More Model-X approaches to testing conditional independence between a predictor and an outcome variable given a vector of covariates usually assume exact knowledge of the conditional distribution of the predictor given the covariates. Nevertheless, model-X methodologies are often deployed with this conditional distribution learned in sample. We investigate the consequences of this choice through the lens of the distilled conditional randomization test (dCRT). We find that Type-I error control is still possible, but only if the mean of the outcome variable given the covariates is estimated well enough. This demonstrates that the dCRT is doubly robust, and motivates a comparison to the generalized covariance measure (GCM) test, another doubly robust conditional independence test. We prove that these two tests are asymptotically equivalent, and show that the GCM test is optimal against (generalized) partially linear alternatives by leveraging semiparametric efficiency theory. In an extensive simulation study, we compare the dCRT to the GCM test. These two tests have broadly similar Type-I error and power, though dCRT can have somewhat better Type-I error control but somewhat worse power in small samples or when the response is discrete. We also find that post-lasso based test statistics (as compared to lasso based statistics) can dramatically improve Type-I error control for both methods. △ Less

Submitted 8 February, 2023; v1 submitted 26 November, 2022; originally announced November 2022.

arXiv:2201.01879 [pdf, other]

Exponential family measurement error models for single-cell CRISPR screens

Authors: Timothy Barry, Kathryn Roeder, Eugene Katsevich

Abstract: CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present substantial statistical challenges. We demonstrate… ▽ More CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present substantial statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens -- "thresholded regression" -- exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV ("GLM-based errors-in-variables"), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g., Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several novel insights. △ Less

Submitted 12 March, 2024; v1 submitted 5 January, 2022; originally announced January 2022.

arXiv:2102.11253 [pdf, other]

Large-scale simultaneous inference under dependence

Authors: Jinjin Tian, Xu Chen, Eugene Katsevich, Jelle Goeman, Aaditya Ramdas

Abstract: Simultaneous inference allows for the exploration of data while deciding on criteria for proclaiming discoveries. It was recently proved that all admissible post-hoc inference methods for true discoveries must employ closed testing. In this paper, we investigate efficient closed testing with local tests of a special form: thresholding a function of sums of test scores for the individual hypotheses… ▽ More Simultaneous inference allows for the exploration of data while deciding on criteria for proclaiming discoveries. It was recently proved that all admissible post-hoc inference methods for true discoveries must employ closed testing. In this paper, we investigate efficient closed testing with local tests of a special form: thresholding a function of sums of test scores for the individual hypotheses. Under this special design, we propose a new statistic that quantifies the cost of multiplicity adjustments, and we develop fast (mostly linear-time) algorithms for post-hoc inference. Paired with recent advances in global null tests based on generalized means, our work instantiates a series of simultaneous inference methods that can handle many dependence structures and signal compositions. We provide guidance on the method choices via theoretical investigation of the conservativeness and sensitivity for different local tests, as well as simulations that find analogous behavior for local tests and full closed testing. △ Less

Submitted 22 March, 2022; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: 41 pages

arXiv:2006.08482

The leave-one-covariate-out conditional randomization test

Authors: Eugene Katsevich, Aaditya Ramdas

Abstract: Conditional independence testing is an important problem, yet provably hard without assumptions. One of the assumptions that has become popular of late is called "model-X", where we assume we know the joint distribution of the covariates, but assume nothing about the conditional distribution of the outcome given the covariates. Knockoffs is a popular methodology associated with this framework, but… ▽ More Conditional independence testing is an important problem, yet provably hard without assumptions. One of the assumptions that has become popular of late is called "model-X", where we assume we know the joint distribution of the covariates, but assume nothing about the conditional distribution of the outcome given the covariates. Knockoffs is a popular methodology associated with this framework, but it suffers from two main drawbacks: only one-bit $p$-values are available for inference on each variable, and the method is randomized with significant variability across runs in practice. The conditional randomization test (CRT) is thought to be the "right" solution under model-X, but usually viewed as computationally inefficient. This paper proposes a computationally efficient leave-one-covariate-out (LOCO) CRT that addresses both drawbacks of knockoffs. LOCO CRT produces valid $p$-values that can be used to control the familywise error rate, and has nearly zero algorithmic variability. For L1 regularized M-estimators, we develop an even faster variant called L1ME CRT, which reuses computation by leveraging a novel observation about the stability of the cross-validated lasso to removing inactive variables. Last, for multivariate Gaussian covariates, we present a closed form expression for the LOCO CRT $p$-value, thus completely eliminating resampling in this important special case. △ Less

Submitted 13 July, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: This paper has been withdrawn by the authors, because it has now been merged with (and superseded by) a parallel work arXiv:2006.03980 by Molei Liu and Lucas Janson

arXiv:2006.03980 [pdf, other]

Fast and Powerful Conditional Randomization Testing via Distillation

Authors: Molei Liu, Eugene Katsevich, Lucas Janson, Aaditya Ramdas

Abstract: We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Y is independent of X given Z. The conditional randomization test (CRT) was recently proposed as a way to use distributional information about X|Z to exactly (non-asymptotically) control Type-I error using any test statistic in any dimensionality without assuming a… ▽ More We consider the problem of conditional independence testing: given a response Y and covariates (X,Z), we test the null hypothesis that Y is independent of X given Z. The conditional randomization test (CRT) was recently proposed as a way to use distributional information about X|Z to exactly (non-asymptotically) control Type-I error using any test statistic in any dimensionality without assuming anything about Y|(X,Z). This flexibility in principle allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the CRT is prohibitively computationally expensive, especially with multiple testing, due to the CRT's requirement to recompute the test statistic many times on resampled data. We propose the distilled CRT, a novel approach to using state-of-the-art machine learning algorithms in the CRT while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the CRT's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks like screening and recycling computations to further speed up the CRT without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing CRT implementations but requires orders of magnitude less computation, making it a practical tool even for large data sets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage. △ Less

Submitted 4 June, 2021; v1 submitted 6 June, 2020; originally announced June 2020.

Comments: This paper has been merged with a parallel work arXiv:2006.08482 by Eugene Katsevich and Aaditya Ramdas

arXiv:2005.05506 [pdf, other]

On the power of conditional independence testing under model-X

Authors: Eugene Katsevich, Aaditya Ramdas

Abstract: For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative insights into the… ▽ More For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative insights into the role of machine learning and providing evidence in favor of using likelihood-based statistics in practice. Focusing on the conditional randomization test (CRT), we find that its conditional mode of inference allows us to reformulate it as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma then implies that a likelihood-based statistic yields the most powerful CRT against a point alternative. We also obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the limiting power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known, a significant relaxation of the MX assumption. △ Less

Submitted 29 October, 2022; v1 submitted 11 May, 2020; originally announced May 2020.

arXiv:2003.07236 [pdf, other]

doi 10.2140/paa.2021.3.595

Analysis of a fourth order exponential PDE arising from a crystal surface jump process with Metropolis-type transition rates

Authors: Yuan Gao, Anya E. Katsevich, Jian-Guo Liu, Jianfeng Lu, Jeremy L. Marzuola

Abstract: We analytically and numerically study a fourth order PDE modeling rough crystal surface diffusion on the macroscopic level. We discuss existence of solutions globally in time and long time dynamics for the PDE model. The PDE, originally derived by the second author, is the continuum limit of a microscopic model of the surface dynamics, given by a Markov jump process with Metropolis type transition… ▽ More We analytically and numerically study a fourth order PDE modeling rough crystal surface diffusion on the macroscopic level. We discuss existence of solutions globally in time and long time dynamics for the PDE model. The PDE, originally derived by the second author, is the continuum limit of a microscopic model of the surface dynamics, given by a Markov jump process with Metropolis type transition rates. We outline the convergence argument, which depends on a simplifying assumption on the local equilibrium measure that is valid in the high temperature regime. We provide numerical evidence for the convergence of the microscopic model to the PDE in this regime. △ Less

Submitted 19 November, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

Comments: 14 pages, 4 figures, comments welcome! Revised significantly thanks to very thorough referee reports. Some previous discussions have been removed and will be reported in a separate result by one of the authors

Journal ref: Pure Appl. Analysis 3 (2021) 595-612

arXiv:1809.01792 [pdf, other]

Filtering the rejection set while preserving false discovery rate control

Authors: Eugene Katsevich, Chiara Sabatti, Marina Bogomolov

Abstract: Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies amon… ▽ More Scientific hypotheses in a variety of applications have domain-specific structures, such as the tree structure of the International Classification of Diseases (ICD), the directed acyclic graph structure of the Gene Ontology (GO), or the spatial structure in genome-wide association studies. In the context of multiple testing, the resulting relationships among hypotheses can create redundancies among rejections that hinder interpretability. This leads to the practice of filtering rejection sets obtained from multiple testing procedures, which may in turn invalidate their inferential guarantees. We propose Focused BH, a simple, flexible, and principled methodology to adjust for the application of any pre-specified filter. We prove that Focused BH controls the false discovery rate under various conditions, including when the filter satisfies an intuitive monotonicity property and the p-values are positively dependent. We demonstrate in simulations that Focused BH performs well across a variety of settings, and illustrate this method's practical utility via analyses of real datasets based on ICD and GO. △ Less

Submitted 10 April, 2020; v1 submitted 5 September, 2018; originally announced September 2018.

arXiv:1803.06790 [pdf, other]

doi 10.1214/19-AOS1938

Simultaneous high-probability bounds on the false discovery proportion in structured, regression, and online settings

Authors: Eugene Katsevich, Aaditya Ramdas

Abstract: While traditional multiple testing procedures prohibit adaptive analysis choices made by users, Goeman and Solari (2011) proposed a simultaneous inference framework that allows users such flexibility while preserving high-probability bounds on the false discovery proportion (FDP) of the chosen set. In this paper, we propose a new class of such simultaneous FDP bounds, tailored for nested sequences… ▽ More While traditional multiple testing procedures prohibit adaptive analysis choices made by users, Goeman and Solari (2011) proposed a simultaneous inference framework that allows users such flexibility while preserving high-probability bounds on the false discovery proportion (FDP) of the chosen set. In this paper, we propose a new class of such simultaneous FDP bounds, tailored for nested sequences of rejection sets. While most existing simultaneous FDP bounds are based on closed testing using global null tests based on sorted p-values, we additionally consider the setting where side information can be leveraged to boost power, the variable selection setting where knockoff statistics can be used to order variables, and the online setting where decisions about rejections must be made as data arrives. Our finite-sample, closed form bounds are based on repurposing the FDP estimates from false discovery rate (FDR) controlling procedures designed for each of the above settings. These results establish a novel connection between the parallel literatures of simultaneous FDP bounds and FDR control methods, and use proof techniques employing martingales and filtrations that are new to both these literatures. We demonstrate the utility of our results by augmenting a recent knockoffs analysis of the UK Biobank dataset. △ Less

Submitted 1 December, 2019; v1 submitted 18 March, 2018; originally announced March 2018.

Journal ref: Annals of Statistics 2020, Vol. 48, No. 6, 3465-3487

arXiv:1706.09375 [pdf, other]

Multilayer Knockoff Filter: Controlled variable selection at multiple resolutions

Authors: Eugene Katsevich, Chiara Sabatti

Abstract: We tackle the problem of selecting from among a large number of variables those that are 'important' for an outcome. We consider situations where groups of variables are also of interest in their own right. For example, each variable might be a genetic polymorphism and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorph… ▽ More We tackle the problem of selecting from among a large number of variables those that are 'important' for an outcome. We consider situations where groups of variables are also of interest in their own right. For example, each variable might be a genetic polymorphism and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. Or, variables might quantify various aspects of the functioning of individual internet servers owned by a company, and we might be interested in assessing the importance of each server as a whole on the average download speed for the company's customers. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful and reproducible results, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candes (2015) and the multilayer testing framework of Barber and Ramdas (2016), we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power. △ Less

Submitted 9 August, 2018; v1 submitted 28 June, 2017; originally announced June 2017.

arXiv:1412.0985 [pdf, other]

doi 10.1109/ISBI.2015.7163849

Covariance estimation using conjugate gradient for 3D classification in Cryo-EM

Authors: Joakim Andén, Eugene Katsevich, Amit Singer

Abstract: Classifying structural variability in noisy projections of biological macromolecules is a central problem in Cryo-EM. In this work, we build on a previous method for estimating the covariance matrix of the three-dimensional structure present in the molecules being imaged. Our proposed method allows for incorporation of contrast transfer function and non-uniform distribution of viewing angles, maki… ▽ More Classifying structural variability in noisy projections of biological macromolecules is a central problem in Cryo-EM. In this work, we build on a previous method for estimating the covariance matrix of the three-dimensional structure present in the molecules being imaged. Our proposed method allows for incorporation of contrast transfer function and non-uniform distribution of viewing angles, making it more suitable for real-world data. We evaluate its performance on a synthetic dataset and an experimental dataset obtained by imaging a 70S ribosome complex. △ Less

Submitted 11 February, 2015; v1 submitted 2 December, 2014; originally announced December 2014.

Showing 1–15 of 15 results for author: Katsevich, E