-
To tweak or not to tweak. How exploiting flexibilities in gene set analysis leads to over-optimism
Authors:
Milena Wünsch,
Christina Sauer,
Moritz Herrmann,
Ludwig Christian Hinske,
Anne-Laure Boulesteix
Abstract:
Gene set analysis, a popular approach for analysing high-throughput gene expression data, aims to identify sets of genes that show enriched expression patterns between two conditions. In addition to the multitude of methods available for this task, users are typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibili…
▽ More
Gene set analysis, a popular approach for analysing high-throughput gene expression data, aims to identify sets of genes that show enriched expression patterns between two conditions. In addition to the multitude of methods available for this task, users are typically left with many options when creating the required input and specifying the internal parameters of the chosen method. This flexibility can lead to uncertainty about the 'right' choice, further reinforced by a lack of evidence-based guidance. Especially when their statistical experience is scarce, this uncertainty might entice users to produce preferable results using a 'trial-and-error' approach. While it may seem unproblematic at first glance, this practice can be viewed as a form of 'cherry-picking' and cause an optimistic bias, rendering the results non-replicable on independent data. After this problem has attracted a lot of attention in the context of classical hypothesis testing, we now aim to raise awareness of such over-optimism in the different and more complex context of gene set analyses. We mimic a hypothetical researcher who systematically selects the analysis variants yielding their preferred results, thereby considering three distinct goals they might pursue. Using a selection of popular gene set analysis methods, we tweak the results in this way for two frequently used benchmark gene expression data sets. Our study indicates that the potential for over-optimism is particularly high for a group of methods frequently used despite being commonly criticised. We conclude by providing practical recommendations to counter over-optimism in research findings in gene set analysis and beyond.
△ Less
Submitted 1 February, 2024;
originally announced February 2024.
-
Addressing researcher degrees of freedom through minP adjustment
Authors:
Maximilian M Mandl,
Andrea S Becker-Pennrich,
Ludwig C Hinske,
Sabine Hoffmann,
Anne-Laure Boulesteix
Abstract:
When different researchers study the same research question using the same dataset they may obtain different and potentially even conflicting results. This is because there is often substantial flexibility in researchers' analytical choices, an issue also referred to as ''researcher degrees of freedom''. Combined with selective reporting of the smallest p-value or largest effect, researcher degree…
▽ More
When different researchers study the same research question using the same dataset they may obtain different and potentially even conflicting results. This is because there is often substantial flexibility in researchers' analytical choices, an issue also referred to as ''researcher degrees of freedom''. Combined with selective reporting of the smallest p-value or largest effect, researcher degrees of freedom may lead to an increased rate of false positive and overoptimistic results. In this paper, we address this issue by formalizing the multiplicity of analysis strategies as a multiple testing problem. As the test statistics of different analysis strategies are usually highly dependent, a naive approach such as the Bonferroni correction is inappropriate because it leads to an unacceptable loss of power. Instead, we propose using the ''minP'' adjustment method, which takes potential test dependencies into account and approximates the underlying null distribution of the minimal p-value through a permutation-based procedure. This procedure is known to achieve more power than simpler approaches while ensuring a weak control of the family-wise error rate. We illustrate our approach for addressing researcher degrees of freedom by applying it to a study on the impact of perioperative paO2 on post-operative complications after neurosurgery. A total of 48 analysis strategies are considered and adjusted using the minP procedure. This approach allows to selectively report the result of the analysis strategy yielding the most convincing evidence, while controlling the type 1 error -- and thus the risk of publishing false positive results that may not be replicable.
△ Less
Submitted 21 January, 2024;
originally announced January 2024.
-
From RNA sequencing measurements to the final results: a practical guide to navigating the choices and uncertainties of gene set analysis
Authors:
Milena Wünsch,
Christina Sauer,
Patrick Callahan,
Ludwig Christian Hinske,
Anne-Laure Boulesteix
Abstract:
Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between different conditions. In the last years, a multitude of methods and corresponding tools have been developed for this task. However, clear guidance is lacking: choosing the right method is the first…
▽ More
Gene set analysis, a popular approach for analyzing high-throughput gene expression data, aims to identify sets of related genes that show significantly enriched or depleted expression patterns between different conditions. In the last years, a multitude of methods and corresponding tools have been developed for this task. However, clear guidance is lacking: choosing the right method is the first hurdle a researcher is confronted with. No less challenging than overcoming this so-called method uncertainty is the procedure of preprocessing, from knowing which steps are required to selecting a corresponding approach from the plethora of valid options to create the accepted input object (data preprocessing uncertainty), with clear guidance again being scarce. Here, we provide a practical guide through all steps required to conduct gene set analysis, beginning with a concise overview of a selection of established methods, including GSEA and DAVID. We thereby lay a special focus on reviewing and explaining the necessary preprocessing steps for each method under consideration (e.g. the necessity of a transformation of the RNA-Seq data)-an essential aspect that is typically paid only limited attention to in both existing reviews and applications. To raise awareness of the spectrum of uncertainties, our review is accompanied by an extensive overview of the literature on valid approaches for each step and illustrative R code demonstrating the complex analysis pipelines. It ends with a discussion and recommendations to both users and developers to ensure that the results of gene set analysis are, despite the above-mentioned uncertainties, replicable and transparent.
△ Less
Submitted 29 August, 2023;
originally announced August 2023.