-
Nonparametric methods controlling the median of the false discovery proportion
Authors:
Jesse Hemerik
Abstract:
When testing many hypotheses, often we do not have strong expectations about the directions of the effects. In some situations however, the alternative hypotheses are that the parameters lie in a certain direction or interval, and it is in fact expected that most hypotheses are false. This is often the case when researchers perform multiple noninferiority or equivalence tests, e.g. when testing fo…
▽ More
When testing many hypotheses, often we do not have strong expectations about the directions of the effects. In some situations however, the alternative hypotheses are that the parameters lie in a certain direction or interval, and it is in fact expected that most hypotheses are false. This is often the case when researchers perform multiple noninferiority or equivalence tests, e.g. when testing food safety with metabolite data. The goal is then to use data to corroborate the expectation that most hypotheses are false. We propose a nonparametric multiple testing approach that is powerful in such situations. If the user's expectations are wrong, our approach will still be valid but have low power. Of course all multiple testing methods become more powerful when appropriate one-sided instead of two-sided tests are used, but our approach has superior power then. The methods in this paper control the median of the false discovery proportion (FDP), which is the fraction of false discoveries among the rejected hypotheses. This approach is comparable to false discovery rate control, where one ensures that the mean rather than the median of the FDP is small. Our procedures make use of a symmetry property of the test statistics, do not require independence and are valid for finite samples.
△ Less
Submitted 29 January, 2025; v1 submitted 28 January, 2025;
originally announced January 2025.
-
Choosing alpha post hoc: the danger of multiple standard significance thresholds
Authors:
Jesse Hemerik,
Nick W Koning
Abstract:
A fundamental assumption of classical hypothesis testing is that the significance threshold $α$ is chosen independently from the data. The validity of confidence intervals likewise relies on choosing $α$ beforehand. We point out that the independence of $α$ is guaranteed in practice because, in most fields, there exists one standard $α$ that everyone uses -- so that $α$ is automatically independen…
▽ More
A fundamental assumption of classical hypothesis testing is that the significance threshold $α$ is chosen independently from the data. The validity of confidence intervals likewise relies on choosing $α$ beforehand. We point out that the independence of $α$ is guaranteed in practice because, in most fields, there exists one standard $α$ that everyone uses -- so that $α$ is automatically independent of everything. However, there have been recent calls to decrease $α$ from $0.05$ to $0.005$. We note that this may lead to multiple accepted standard thresholds within one scientific field. For example, different journals may require different significance thresholds. As a consequence, some researchers may be tempted to conveniently choose their $α$ based on their p-value. We use examples to illustrate that this severely invalidates hypothesis tests, and mention some potential solutions.
△ Less
Submitted 10 March, 2025; v1 submitted 3 October, 2024;
originally announced October 2024.
-
Robust Inference for Generalized Linear Mixed Models: An Approach Based on Score Sign Flipping
Authors:
Angela Andreella,
Jelle Goeman,
Jesse Hemerik,
Livio Finos
Abstract:
Despite the versatility of generalized linear mixed models in handling complex experimental designs, they often suffer from misspecification and convergence problems. This makes inference on the values of coefficients problematic. To address these challenges, we propose a robust extension of the score-based statistical test using sign-flipping transformations. Our approach efficiently handles with…
▽ More
Despite the versatility of generalized linear mixed models in handling complex experimental designs, they often suffer from misspecification and convergence problems. This makes inference on the values of coefficients problematic. To address these challenges, we propose a robust extension of the score-based statistical test using sign-flipping transformations. Our approach efficiently handles within-variance structure and heteroscedasticity, ensuring accurate regression coefficient testing. The approach is illustrated by analyzing the reduction of health issues over time for newly adopted children. The model is characterized by a binomial response with unbalanced frequencies and several categorical and continuous predictors. The proposed approach efficiently deals with critical problems related to longitudinal nonlinear models, surpassing common statistical approaches such as generalized estimating equations and generalized linear mixed models.
△ Less
Submitted 27 March, 2025; v1 submitted 31 January, 2024;
originally announced January 2024.
-
On the term "randomization test"
Authors:
Jesse Hemerik
Abstract:
There exists no consensus on the meaning of the term "randomization test". Contradicting uses of the term are leading to confusion, misunderstandings and indeed invalid data analyses. As we point out, a main source of the confusion is that the term was not explicitly defined when it was first used in the 1930's. Later authors made clear proposals to reach a consensus regarding the term. This resul…
▽ More
There exists no consensus on the meaning of the term "randomization test". Contradicting uses of the term are leading to confusion, misunderstandings and indeed invalid data analyses. As we point out, a main source of the confusion is that the term was not explicitly defined when it was first used in the 1930's. Later authors made clear proposals to reach a consensus regarding the term. This resulted in some level of agreement around the 1970's. However, in the last few decades, the term has often been used in ways that contradict these proposals. This paper provides an overview of the history of the term per se, for the first time tracing it back to 1937. This will hopefully lead to more agreement on terminology and less confusion on the related fundamental concepts.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Inference in generalized linear models with robustness to misspecified variances
Authors:
Riccardo De Santis,
Jelle J. Goeman,
Jesse Hemerik,
Samuel Davenport,
Livio Finos
Abstract:
Generalized linear models usually assume a common dispersion parameter, an assumption that is seldom true in practice. Consequently, standard parametric methods may suffer appreciable loss of type I error control. As an alternative, we present a semi-parametric group-invariance method based on sign flipping of score contributions. Our method requires only the correct specification of the mean mode…
▽ More
Generalized linear models usually assume a common dispersion parameter, an assumption that is seldom true in practice. Consequently, standard parametric methods may suffer appreciable loss of type I error control. As an alternative, we present a semi-parametric group-invariance method based on sign flipping of score contributions. Our method requires only the correct specification of the mean model, but is robust against any misspecification of the variance. We present tests for single as well as multiple regression coefficients. The test is asymptotically valid but shows excellent performance in small samples. We illustrate the method using RNA sequencing count data, for which it is difficult to model the overdispersion correctly. The method is available in the R library flipscores.
△ Less
Submitted 13 September, 2024; v1 submitted 28 September, 2022;
originally announced September 2022.
-
Flexible control of the median of the false discovery proportion
Authors:
Jesse Hemerik,
Aldo Solari,
Jelle J Goeman
Abstract:
We introduce a multiple testing procedure that controls the median of the proportion of false discoveries (FDP) in a flexible way. The procedure only requires a vector of p-values as input and is comparable to the Benjamini-Hochberg method, which controls the mean of the FDP. Our method allows freely choosing one or several values of alpha after seeing the data -- unlike Benjamini-Hochberg, which…
▽ More
We introduce a multiple testing procedure that controls the median of the proportion of false discoveries (FDP) in a flexible way. The procedure only requires a vector of p-values as input and is comparable to the Benjamini-Hochberg method, which controls the mean of the FDP. Our method allows freely choosing one or several values of alpha after seeing the data -- unlike Benjamini-Hochberg, which can be very liberal when alpha is chosen post hoc. We prove these claims and illustrate them with simulations. Our procedure is inspired by a popular estimator of the total number of true hypotheses. We adapt this estimator to provide simultaneously median unbiased estimators of the FDP, valid for finite samples. This simultaneity allows for the claimed flexibility. Our approach does not assume independence. The time complexity of our method is linear in the number of hypotheses, after sorting the p-values.
△ Less
Submitted 13 March, 2024; v1 submitted 24 August, 2022;
originally announced August 2022.
-
More Efficient Exact Group-Invariance Testing: using a Representative Subgroup
Authors:
Nick W. Koning,
Jesse Hemerik
Abstract:
Non-parametric tests based on permutation, rotation or sign-flipping are examples of group-invariance tests. These tests test invariance of the null distribution under a set of transformations that has a group structure, in the algebraic sense. Such groups are often huge, which makes it computationally infeasible to test using the entire group. Hence, it is standard practice to test using a random…
▽ More
Non-parametric tests based on permutation, rotation or sign-flipping are examples of group-invariance tests. These tests test invariance of the null distribution under a set of transformations that has a group structure, in the algebraic sense. Such groups are often huge, which makes it computationally infeasible to test using the entire group. Hence, it is standard practice to test using a randomly sampled set of transformations from the group. This random sample still needs to be substantial to obtain good power and replicability. We improve upon this standard practice by using a well-designed subgroup of transformations instead of a random sample. The resulting subgroup-invariance test is still exact, as invariance under a group implies invariance under its subgroups.
We illustrate this in a generalized location model and obtain more powerful tests based on the same number of transformations. In particular, we show that a subgroup-invariance test is consistent for lower signal-to-noise ratios than a test based on a random sample. For the special case of a normal location model and a particular design of the subgroup, we show that the power improvement is equivalent to the power difference between a Monte Carlo $Z$-test and a Monte Carlo $t$-test.
△ Less
Submitted 22 November, 2022; v1 submitted 2 February, 2022;
originally announced February 2022.
-
Permutation-based true discovery proportions for functional Magnetic Resonance Imaging cluster analysis
Authors:
Angela Andreella,
Jesse Hemerik,
Wouter Weeda,
Livio Finos,
Jelle Goeman
Abstract:
We propose a permutation-based method for testing a large collection of hypotheses simultaneously. Our method provides lower bounds for the number of true discoveries in any selected subset of hypotheses. These bounds are simultaneously valid with high confidence. The methodology is particularly useful in functional Magnetic Resonance Imaging cluster analysis, where it provides a confidence statem…
▽ More
We propose a permutation-based method for testing a large collection of hypotheses simultaneously. Our method provides lower bounds for the number of true discoveries in any selected subset of hypotheses. These bounds are simultaneously valid with high confidence. The methodology is particularly useful in functional Magnetic Resonance Imaging cluster analysis, where it provides a confidence statement on the percentage of truly activated voxels within clusters of voxels, avoiding the well-known spatial specificity paradox. We offer a user-friendly tool to estimate the percentage of true discoveries for each cluster while controlling the family-wise error rate for multiple testing and taking into account that the cluster was chosen in a data-driven way. The method adapts to the spatial correlation structure that characterizes functional Magnetic Resonance Imaging data, gaining power over parametric approaches.
△ Less
Submitted 26 January, 2023; v1 submitted 1 December, 2020;
originally announced December 2020.
-
On optimal two-stage testing of multiple mediators
Authors:
Vera Djordjilović,
Jesse Hemerik,
Magne Thoresen
Abstract:
Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two-step familywise error rate procedure called ScreenMin has been recently proposed (Djordjilović et al. 2019). In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed threshold for selecti…
▽ More
Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two-step familywise error rate procedure called ScreenMin has been recently proposed (Djordjilović et al. 2019). In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed threshold for selection has been shown to guarantee asymptotic familywise error rate. In this work, we investigate the impact of the selection threshold on the finite sample familywise error rate. We derive a power maximizing selection threshold and show that it is well approximated by an adaptive threshold of Wang et al. (2016). We illustrate the investigated procedures on a case-control study examining the effect of fish intake on the risk of colorectal adenoma.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
Permutation testing in high-dimensional linear models: an empirical investigation
Authors:
Jesse Hemerik,
Magne Thoresen,
Livio Finos
Abstract:
Permutation testing in linear models, where the number of nuisance coefficients is smaller than the sample size, is a well-studied topic. The common approach of such tests is to permute residuals after regressing on the nuisance covariates. Permutation-based tests are valuable in particular because they can be highly robust to violations of the standard linear model, such as non-normality and hete…
▽ More
Permutation testing in linear models, where the number of nuisance coefficients is smaller than the sample size, is a well-studied topic. The common approach of such tests is to permute residuals after regressing on the nuisance covariates. Permutation-based tests are valuable in particular because they can be highly robust to violations of the standard linear model, such as non-normality and heteroscedasticity. Moreover, in some cases they can be combined with existing, powerful permutation-based multiple testing methods. Here, we propose permutation tests for models where the number of nuisance coefficients exceeds the sample size. The performance of the novel tests is investigated with simulations. In a wide range of simulation scenarios our proposed permutation methods provided appropriate type I error rate control, unlike some competing tests, while having good power.
△ Less
Submitted 8 October, 2020; v1 submitted 6 January, 2020;
originally announced January 2020.
-
Another look at the Lady Tasting Tea and differences between permutation tests and randomization tests
Authors:
Jesse Hemerik,
Jelle J. Goeman
Abstract:
The statistical literature is known to be inconsistent in the use of the terms "permutation test" and "randomization test". Several authors succesfully argue that these terms should be used to refer to two distinct classes of tests and that there are major conceptual differences between these classes. The present paper explains an important difference in mathematical reasoning between these classe…
▽ More
The statistical literature is known to be inconsistent in the use of the terms "permutation test" and "randomization test". Several authors succesfully argue that these terms should be used to refer to two distinct classes of tests and that there are major conceptual differences between these classes. The present paper explains an important difference in mathematical reasoning between these classes: a permutation test fundamentally requires that the set of permutations has a group structure, in the algebraic sense; the reasoning behind a randomization test is not based on such a group structure and it is possible to use an experimental design that does not correspond to a group. In particular, we can use a randomization scheme where the number of possible treatment patterns is larger than in standard experimental designs. This leads to exact \emph{p}-values of improved resolution, providing increased power for very small significance levels, at the cost of decreased power for larger significance levels. We discuss applications in randomized trials and elsewhere. Further, we explain that Fisher's famous Lady Tasting Tea experiment, which is commonly referred to as the first permutation test, is in fact a randomization test. This distinction is important to avoid confusion and invalid tests.
△ Less
Submitted 6 October, 2020; v1 submitted 5 December, 2019;
originally announced December 2019.
-
Optimal two-stage testing of multiple mediators
Authors:
Vera Djordjilović,
Jesse Hemerik,
Magne Thoresen
Abstract:
Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two step familywise error rate (FWER) procedure called ScreenMin has been recently proposed (Djordjilović et al. 2019). In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed threshold for…
▽ More
Mediation analysis in high-dimensional settings often involves identifying potential mediators among a large number of measured variables. For this purpose, a two step familywise error rate (FWER) procedure called ScreenMin has been recently proposed (Djordjilović et al. 2019). In ScreenMin, variables are first screened and only those that pass the screening are tested. The proposed threshold for selection has been shown to guarantee asymptotic FWER. In this work, we investigate the impact of the selection threshold on the finite sample FWER. We derive power maximizing selection threshold and show that it is well approximated by an adaptive threshold of Wang et al. (2016). We study the performance of the proposed procedures in a simulation study, and apply them to a case-control study examining the effect of fish intake on the risk of colorectal adenoma.
△ Less
Submitted 3 November, 2019;
originally announced November 2019.
-
Robust testing in generalized linear models by sign-flipping score contributions
Authors:
Jesse Hemerik,
Jelle J Goeman,
Livio Finos
Abstract:
Generalized linear models are often misspecified due to overdispersion, heteroscedasticity and ignored nuisance variables. Existing quasi-likelihood methods for testing in misspecified models often do not provide satisfactory type-I error rate control. We provide a novel semi-parametric test, based on sign-flipping individual score contributions. The tested parameter is allowed to be multi-dimensi…
▽ More
Generalized linear models are often misspecified due to overdispersion, heteroscedasticity and ignored nuisance variables. Existing quasi-likelihood methods for testing in misspecified models often do not provide satisfactory type-I error rate control. We provide a novel semi-parametric test, based on sign-flipping individual score contributions. The tested parameter is allowed to be multi-dimensional and even high-dimensional. Our test is often robust against the mentioned forms of misspecification and provides better type-I error control than its competitors. When nuisance parameters are estimated, our basic test becomes conservative. We show how to take nuisance estimation into account to obtain an asymptotically exact test. Our proposed test is asymptotically equivalent to its parametric counterpart.
△ Less
Submitted 8 May, 2020; v1 submitted 9 September, 2019;
originally announced September 2019.
-
Only Closed Testing Procedures are Admissible for Controlling False Discovery Proportions
Authors:
Jelle Goeman,
Jesse Hemerik,
Aldo Solari
Abstract:
We consider the class of all multiple testing methods controlling tail probabilities of the false discovery proportion, either for one random set or simultaneously for many such sets. This class encompasses methods controlling familywise error rate, generalized familywise error rate, false discovery exceedance, joint error rate, simultaneous control of all false discovery proportions, and others,…
▽ More
We consider the class of all multiple testing methods controlling tail probabilities of the false discovery proportion, either for one random set or simultaneously for many such sets. This class encompasses methods controlling familywise error rate, generalized familywise error rate, false discovery exceedance, joint error rate, simultaneous control of all false discovery proportions, and others, as well as seemingly unrelated methods such as gene set testing in genomics and cluster inference methods in neuroimaging. We show that all such methods are either equivalent to a closed testing method, or are uniformly improved by one. Moreover, we show that a closed testing method is admissible as a method controlling tail probabilities of false discovery proportions if and only if all its local tests are admissible. This implies that, when designing such methods, it is sufficient to restrict attention to closed testing methods only. We demonstrate the practical usefulness of this design principle by constructing a uniform improvement of a recently proposed method.
△ Less
Submitted 29 April, 2022; v1 submitted 15 January, 2019;
originally announced January 2019.
-
Permutation-based simultaneous confidence bounds for the false discovery proportion
Authors:
Jesse Hemerik,
Aldo Solari,
Jelle J. Goeman
Abstract:
When multiple hypotheses are tested, interest is often in ensuring that the proportion of false discoveries (FDP) is small with high confidence. In this paper, confidence upper bounds for the FDP are constructed, which are simultaneous over all rejection cut-offs. In particular this allows the user to select a set of hypotheses post hoc such that the FDP lies below some constant with high confiden…
▽ More
When multiple hypotheses are tested, interest is often in ensuring that the proportion of false discoveries (FDP) is small with high confidence. In this paper, confidence upper bounds for the FDP are constructed, which are simultaneous over all rejection cut-offs. In particular this allows the user to select a set of hypotheses post hoc such that the FDP lies below some constant with high confidence. Our method uses permutations to account for the dependence structure in the data. So far only Meinshausen provided an exact, permutation-based and computationally feasible method for simultaneous FDP bounds. We provide an exact method, which uniformly improves this procedure. Further, we provide a generalization of this method. It lets the user select the shape of the simultaneous confidence bounds. This gives the user more freedom in determining the power properties of the method. Interestingly, several existing permutation methods, such as Significance Analysis of Microarrays (SAM) and Westfall and Young's maxT method, are obtained as special cases.
△ Less
Submitted 16 August, 2018;
originally announced August 2018.