-
Choosing a Model, Shaping a Future: Comparing LLM Perspectives on Sustainability and its Relationship with AI
Authors:
Annika Bush,
Meltem Aksoy,
Markus Pauly,
Greta Ontrup
Abstract:
As organizations increasingly rely on AI systems for decision support in sustainability contexts, it becomes critical to understand the inherent biases and perspectives embedded in Large Language Models (LLMs). This study systematically investigates how five state-of-the-art LLMs -- Claude, DeepSeek, GPT, LLaMA, and Mistral - conceptualize sustainability and its relationship with AI. We administer…
▽ More
As organizations increasingly rely on AI systems for decision support in sustainability contexts, it becomes critical to understand the inherent biases and perspectives embedded in Large Language Models (LLMs). This study systematically investigates how five state-of-the-art LLMs -- Claude, DeepSeek, GPT, LLaMA, and Mistral - conceptualize sustainability and its relationship with AI. We administered validated, psychometric sustainability-related questionnaires - each 100 times per model -- to capture response patterns and variability. Our findings revealed significant inter-model differences: For example, GPT exhibited skepticism about the compatibility of AI and sustainability, whereas LLaMA demonstrated extreme techno-optimism with perfect scores for several Sustainable Development Goals (SDGs). Models also diverged in attributing institutional responsibility for AI and sustainability integration, a results that holds implications for technology governance approaches. Our results demonstrate that model selection could substantially influence organizational sustainability strategies, highlighting the need for awareness of model-specific biases when deploying LLMs for sustainability-related decision-making.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
OPTIKS: Optimized Gradient Properties Through Timing in K-Space
Authors:
Matthew A. McCready,
Xiaozhi Cao,
Kawin Setsompop,
John M. Pauly,
Adam B. Kerr
Abstract:
A customizable method (OPTIKS) for designing fast trajectory-constrained gradient waveforms with optimized time domain properties was developed. Given a specified multidimensional k-space trajectory, the method optimizes traversal speed (and therefore timing) with position along the trajectory. OPTIKS facilitates optimization of objectives dependent on the time domain gradient waveform and the arc…
▽ More
A customizable method (OPTIKS) for designing fast trajectory-constrained gradient waveforms with optimized time domain properties was developed. Given a specified multidimensional k-space trajectory, the method optimizes traversal speed (and therefore timing) with position along the trajectory. OPTIKS facilitates optimization of objectives dependent on the time domain gradient waveform and the arc-length domain k-space speed. OPTIKS is applied to design waveforms which limit peripheral nerve stimulation (PNS), minimize mechanical resonance excitation, and reduce acoustic noise. A variety of trajectory examples are presented including spirals, circular echo-planar-imaging, and rosettes. Design performance is evaluated based on duration, standardized PNS models, field measurements, gradient coil back-EMF measurements, and calibrated acoustic measurements. We show reductions in back-EMF of up to 94% and field oscillations up to 91.1%, acoustic noise decreases of up to 9.22 dB, and with efficient use of PNS models speed increases of up to 11.4%. The design method implementation is made available as an open source Python package through GitHub.
△ Less
Submitted 11 May, 2025;
originally announced May 2025.
-
A Cautionary Tale About "Neutrally" Informative AI Tools Ahead of the 2025 Federal Elections in Germany
Authors:
Ina Dormuth,
Sven Franke,
Marlies Hafer,
Tim Katzke,
Alexander Marx,
Emmanuel Müller,
Daniel Neider,
Markus Pauly,
Jérôme Rutinowski
Abstract:
In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we…
▽ More
In this study, we examine the reliability of AI-based Voting Advice Applications (VAAs) and large language models (LLMs) in providing objective political information. Our analysis is based upon a comparison with party responses to 38 statements of the Wahl-O-Mat, a well-established German online tool that helps inform voters by comparing their views with political party positions. For the LLMs, we identify significant biases. They exhibit a strong alignment (over 75% on average) with left-wing parties and a substantially lower alignment with center-right (smaller 50%) and right-wing parties (around 30%). Furthermore, for the VAAs, intended to objectively inform voters, we found substantial deviations from the parties' stated positions in Wahl-O-Mat: While one VAA deviated in 25% of cases, another VAA showed deviations in more than 50% of cases. For the latter, we even observed that simple prompt injections led to severe hallucinations, including false claims such as non-existent connections between political parties and right-wing extremist ties.
△ Less
Submitted 7 April, 2025; v1 submitted 21 February, 2025;
originally announced February 2025.
-
Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study
Authors:
Jakob Schwerter,
Andrés Romero,
Florian Dumpert,
Markus Pauly
Abstract:
Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. Th…
▽ More
Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.
△ Less
Submitted 18 December, 2024;
originally announced December 2024.
-
A Central Limit Theorem for the permutation importance measure
Authors:
Nico Föge,
Lena Schmid,
Marc Ditzhaus,
Markus Pauly
Abstract:
Random Forests have become a widely used tool in machine learning since their introduction in 2001, known for their strong performance in classification and regression tasks. One key feature of Random Forests is the Random Forest Permutation Importance Measure (RFPIM), an internal, non-parametric measure of variable importance. While widely used, theoretical work on RFPIM is sparse, and most resea…
▽ More
Random Forests have become a widely used tool in machine learning since their introduction in 2001, known for their strong performance in classification and regression tasks. One key feature of Random Forests is the Random Forest Permutation Importance Measure (RFPIM), an internal, non-parametric measure of variable importance. While widely used, theoretical work on RFPIM is sparse, and most research has focused on empirical findings. However, recent progress has been made, such as establishing consistency of the RFPIM, although a mathematical analysis of its asymptotic distribution is still missing. In this paper, we provide a formal proof of a Central Limit Theorem for RFPIM using U-Statistics theory. Our approach deviates from the conventional Random Forest model by assuming a random number of trees and imposing conditions on the regression functions and error terms, which must be bounded and additive, respectively. Our result aims at improving the theoretical understanding of RFPIM rather than conducting comprehensive hypothesis testing. However, our contributions provide a solid foundation and demonstrate the potential for future work to extend to practical applications which we also highlight with a small simulation study.
△ Less
Submitted 17 December, 2024;
originally announced December 2024.
-
Quadratic Form based Multiple Contrast Tests for Comparison of Group Means
Authors:
Paavo Sattler,
Markus Pauly,
Merle Munko
Abstract:
Comparing the mean vectors across different groups is a cornerstone in the realm of multivariate statistics, with quadratic forms commonly serving as test statistics. However, when the overall hypothesis is rejected, identifying specific vector components or determining the groups among which differences exist requires additional investigations. Conversely, employing multiple contrast tests (MCT)…
▽ More
Comparing the mean vectors across different groups is a cornerstone in the realm of multivariate statistics, with quadratic forms commonly serving as test statistics. However, when the overall hypothesis is rejected, identifying specific vector components or determining the groups among which differences exist requires additional investigations. Conversely, employing multiple contrast tests (MCT) allows conclusions about which components or groups contribute to these differences. However, they come with a trade-off, as MCT lose some benefits inherent to quadratic forms. In this paper, we combine both approaches to get a quadratic form based multiple contrast test that leverages the advantages of both. To understand its theoretical properties, we investigate its asymptotic distribution in a semiparametric model. We thereby focus on two common quadratic forms - the Wald-type statistic and the Anova-type statistic - although our findings are applicable to any quadratic form.
Furthermore, we employ Monte-Carlo and resampling techniques to enhance the test's performance in small sample scenarios. Through an extensive simulation study, we assess the performance of our proposed tests against existing alternatives, highlighting their advantages.
△ Less
Submitted 3 June, 2025; v1 submitted 15 November, 2024;
originally announced November 2024.
-
Single CASANOVA? Not in multiple comparisons
Authors:
Ina Dormuth,
Carolin Herrmann,
Frank Konietschke,
Markus Pauly,
Matthias Wirth,
Marc Ditzhaus
Abstract:
When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather the location. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferro…
▽ More
When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather the location. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated [1] and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent [2]. We propose two new tests based on combined weighted log-rank tests. One as a simple multiple contrast test of weighted log-rank tests and one as an extension of the so-called CASANOVA test [3]. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test promises to be more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power.
△ Less
Submitted 7 January, 2025; v1 submitted 28 October, 2024;
originally announced October 2024.
-
Is GPT-4 Less Politically Biased than GPT-3.5? A Renewed Investigation of ChatGPT's Political Biases
Authors:
Erik Weber,
Jérôme Rutinowski,
Niklas Jost,
Markus Pauly
Abstract:
This work investigates the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5 to GPT-4. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. The Political Compass Test and the Big Five Personality Test were employed 100 times for each scenario, providing statistically significant results and an…
▽ More
This work investigates the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5 to GPT-4. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. The Political Compass Test and the Big Five Personality Test were employed 100 times for each scenario, providing statistically significant results and an insight into the results correlations. The responses were analyzed by computing averages, standard deviations, and performing significance tests to investigate differences between GPT-3.5 and GPT-4. Correlations were found for traits that have been shown to be interdependent in human studies. Both models showed a progressive and libertarian political bias, with GPT-4's biases being slightly, but negligibly, less pronounced. Specifically, on the Political Compass, GPT-3.5 scored -6.59 on the economic axis and -6.07 on the social axis, whereas GPT-4 scored -5.40 and -4.73. In contrast to GPT-3.5, GPT-4 showed a remarkable capacity to emulate assigned political viewpoints, accurately reflecting the assigned quadrant (libertarian-left, libertarian-right, authoritarian-left, authoritarian-right) in all four tested instances. On the Big Five Personality Test, GPT-3.5 showed highly pronounced Openness and Agreeableness traits (O: 85.9%, A: 84.6%). Such pronounced traits correlate with libertarian views in human studies. While GPT-4 overall exhibited less pronounced Big Five personality traits, it did show a notably higher Neuroticism score. Assigned political orientations influenced Openness, Agreeableness, and Conscientiousness, again reflecting interdependencies observed in human studies. Finally, we observed that test sequencing affected ChatGPT's responses and the observed correlations, indicating a form of contextual memory.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
AR-Sieve Bootstrap for the Random Forest and a simulation-based comparison with rangerts time series prediction
Authors:
Cabrel Teguemne Fokam,
Carsten Jentsch,
Michel Lang,
Markus Pauly
Abstract:
The Random Forest (RF) algorithm can be applied to a broad spectrum of problems, including time series prediction. However, neither the classical IID (Independent and Identically distributed) bootstrap nor block bootstrapping strategies (as implemented in rangerts) completely account for the nature of the Data Generating Process (DGP) while resampling the observations. We propose the combination o…
▽ More
The Random Forest (RF) algorithm can be applied to a broad spectrum of problems, including time series prediction. However, neither the classical IID (Independent and Identically distributed) bootstrap nor block bootstrapping strategies (as implemented in rangerts) completely account for the nature of the Data Generating Process (DGP) while resampling the observations. We propose the combination of RF with a residual bootstrapping technique where we replace the IID bootstrap with the AR-Sieve Bootstrap (ARSB), which assumes the DGP to be an autoregressive process. To assess the new model's predictive performance, we conduct a simulation study using synthetic data generated from different types of DGPs. It turns out that ARSB provides more variation amongst the trees in the forest. Moreover, RF with ARSB shows greater accuracy compared to RF with other bootstrap strategies. However, these improvements are achieved at some efficiency costs.
△ Less
Submitted 1 October, 2024;
originally announced October 2024.
-
Iterative Trace Minimization for the Reconciliation of Very Short Hierarchical Time Series
Authors:
Louis Steinmeister,
Markus Pauly
Abstract:
Time series often appear in an additive hierarchical structure. In such cases, time series on higher levels are the sums of their subordinate time series. This hierarchical structure places a natural constraint on forecasts. However, univariate forecasting techniques are incapable of ensuring this forecast coherence. An obvious solution is to forecast only bottom time series and obtain higher leve…
▽ More
Time series often appear in an additive hierarchical structure. In such cases, time series on higher levels are the sums of their subordinate time series. This hierarchical structure places a natural constraint on forecasts. However, univariate forecasting techniques are incapable of ensuring this forecast coherence. An obvious solution is to forecast only bottom time series and obtain higher level forecasts through aggregation. This approach is also known as the bottom-up approach. In their seminal paper, \citep{Wickramasuriya2019} propose an optimal reconciliation approach named MinT. It tries to minimize the trace of the underlying covariance matrix of all forecast errors. The MinT algorithm has demonstrated superior performance to the bottom-up and other approaches and enjoys great popularity. This paper provides a simulation study examining the performance of MinT for very short time series and larger hierarchical structures. This scenario makes the covariance estimation required by MinT difficult. A novel iterative approach is introduced which significantly reduces the number of estimated parameters. This approach is capable of improving forecast accuracy further. The application of MinTit is also demonstrated with a case study at the hand of a semiconductor dataset based on data provided by the World Semiconductor Trade Statistics (WSTS), a premier provider of semiconductor market data.
△ Less
Submitted 19 March, 2025; v1 submitted 27 September, 2024;
originally announced September 2024.
-
Early and Late Buzzards: Comparing Different Approaches for Quantile-based Multiple Testing in Heavy-Tailed Wildlife Research Data
Authors:
Marléne Baumeister,
Merle Munko,
Kai-Philipp Gladow,
Marc Ditzhaus,
Nayden Chakarov,
Markus Pauly
Abstract:
In medical, ecological and psychological research, there is a need for methods to handle multiple testing, for example to consider group comparisons with more than two groups. Typical approaches that deal with multiple testing are mean or variance based which can be less effective in the context of heavy-tailed and skewed data. Here, the median is the preferred measure of location and the interqua…
▽ More
In medical, ecological and psychological research, there is a need for methods to handle multiple testing, for example to consider group comparisons with more than two groups. Typical approaches that deal with multiple testing are mean or variance based which can be less effective in the context of heavy-tailed and skewed data. Here, the median is the preferred measure of location and the interquartile range (IQR) is an adequate alternative to the variance. Therefore, it may be fruitful to formulate research questions of interest in terms of the median or the IQR. For this reason, we compare different inference approaches for two-sided and non-inferiority hypotheses formulated in terms of medians or IQRs in an extensive simulation study. We consider multiple contrast testing procedures combined with a bootstrap method as well as testing procedures with Bonferroni correction. As an example of a multiple testing problem based on heavy-tailed data we analyse an ecological trait variation in early and late breeding in a medium-sized bird of prey.
△ Less
Submitted 28 April, 2025; v1 submitted 23 September, 2024;
originally announced September 2024.
-
TREE: Tree Regularization for Efficient Execution
Authors:
Lena Schmid,
Daniel Biebert,
Christian Hakert,
Kuan-Hsun Chen,
Michel Lang,
Markus Pauly,
Jian-Jia Chen
Abstract:
The rise of machine learning methods on heavily resource constrained devices requires not only the choice of a suitable model architecture for the target platform, but also the optimization of the chosen model with regard to execution time consumption for inference in order to optimally utilize the available resources. Random forests and decision trees are shown to be a suitable model for such a s…
▽ More
The rise of machine learning methods on heavily resource constrained devices requires not only the choice of a suitable model architecture for the target platform, but also the optimization of the chosen model with regard to execution time consumption for inference in order to optimally utilize the available resources. Random forests and decision trees are shown to be a suitable model for such a scenario, since they are not only heavily tunable towards the total model size, but also offer a high potential for optimizing their executions according to the underlying memory architecture.
In addition to the straightforward strategy of enforcing shorter paths through decision trees and hence reducing the execution time for inference, hardware-aware implementations can optimize the execution time in an orthogonal manner. One particular hardware-aware optimization is to layout the memory of decision trees in such a way, that higher probably paths are less likely to be evicted from system caches. This works particularly well when splits within tree nodes are uneven and have a high probability to visit one of the child nodes.
In this paper, we present a method to reduce path lengths by rewarding uneven probability distributions during the training of decision trees at the cost of a minimal accuracy degradation. Specifically, we regularize the impurity computation of the CART algorithm in order to favor not only low impurity, but also highly asymmetric distributions for the evaluation of split criteria and hence offer a high optimization potential for a memory architecture-aware implementation.
We show that especially for binary classification data sets and data sets with many samples, this form of regularization can lead to an reduction of up to approximately four times in the execution time with a minimal accuracy degradation.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Multiple Comparison Procedures for Simultaneous Inference in Functional MANOVA
Authors:
Merle Munko,
Marc Ditzhaus,
Markus Pauly,
Łukasz Smaga
Abstract:
Functional data analysis is becoming increasingly popular to study data from real-valued random functions. Nevertheless, there is a lack of multiple testing procedures for such data. These are particularly important in factorial designs to compare different groups or to infer factor effects. We propose a new class of testing procedures for arbitrary linear hypotheses in general factorial designs w…
▽ More
Functional data analysis is becoming increasingly popular to study data from real-valued random functions. Nevertheless, there is a lack of multiple testing procedures for such data. These are particularly important in factorial designs to compare different groups or to infer factor effects. We propose a new class of testing procedures for arbitrary linear hypotheses in general factorial designs with functional data. Our methods allow global as well as multiple inference of both, univariate and multivariate mean functions without assuming particular error distributions nor homoscedasticity. That is, we allow for different structures of the covariance functions between groups. To this end, we use point-wise quadratic-form-type test functions that take potential heteroscedasticity into account. Taking the supremum over each test function, we define a class of local test statistics. We analyse their (joint) asymptotic behaviour and propose a resampling approach to approximate the limit distributions. The resulting global and multiple testing procedures are asymptotic valid under weak conditions and applicable in general functional MANOVA settings. We evaluate their small-sample performance in extensive simulations and finally illustrate their applicability by analysing a multivariate functional air pollution data set.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Human Vs. Machines: Who Wins In Semiconductor Market Forecasting?
Authors:
Louis Steinmeister,
Markus Pauly
Abstract:
"If you ask ten experts, you will get ten different opinions." This common proverb illustrates the common association of expert forecasts with personal bias and lack of consistency. On the other hand, digitization promises consistency and explainability through data-driven forecasts employing machine learning (ML) and statistical models. In the following, we compare such forecasts to expert foreca…
▽ More
"If you ask ten experts, you will get ten different opinions." This common proverb illustrates the common association of expert forecasts with personal bias and lack of consistency. On the other hand, digitization promises consistency and explainability through data-driven forecasts employing machine learning (ML) and statistical models. In the following, we compare such forecasts to expert forecasts from the World Semiconductor Trade Statistics (WSTS), a leading semiconductor market data provider.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
Even naive trees are consistent
Authors:
Nico Föge,
Markus Pauly,
Lena Schmid,
Marc Ditzhaus
Abstract:
The last decade has shed some light on theoretical properties such as their consistency for regression tasks. In the current paper, we propose a new class of very simple learners based on so-called naive trees. These naive trees partition the feature space completely at random and independent of the data. Although counter-intuitive, we prove these naive trees and ensembles are consistent under fai…
▽ More
The last decade has shed some light on theoretical properties such as their consistency for regression tasks. In the current paper, we propose a new class of very simple learners based on so-called naive trees. These naive trees partition the feature space completely at random and independent of the data. Although counter-intuitive, we prove these naive trees and ensembles are consistent under fairly general assumptions. However, naive trees appear to be too simple for actual application. We therefore analyze their finite sample properties in a simulation and small benchmark study. We find a slow convergence speed and a rather poor predictive performance. Based on these results, we finally discuss to what extent consistency proofs help to justify the application of complex learning algorithms.
△ Less
Submitted 17 December, 2024; v1 submitted 10 April, 2024;
originally announced April 2024.
-
Behind the Screen: Investigating ChatGPT's Dark Personality Traits and Conspiracy Beliefs
Authors:
Erik Weber,
Jérôme Rutinowski,
Markus Pauly
Abstract:
ChatGPT is notorious for its intransparent behavior. This paper tries to shed light on this, providing an in-depth analysis of the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4. Different psychological tests and questionnaires were employed, including the Dark Factor Test, the Mach-IV Scale, the Generic Conspiracy Belief Scale, and the Conspiracy Mentality Scale. The response…
▽ More
ChatGPT is notorious for its intransparent behavior. This paper tries to shed light on this, providing an in-depth analysis of the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4. Different psychological tests and questionnaires were employed, including the Dark Factor Test, the Mach-IV Scale, the Generic Conspiracy Belief Scale, and the Conspiracy Mentality Scale. The responses were analyzed computing average scores, standard deviations, and significance tests to investigate differences between GPT-3.5 and GPT-4. For traits that have shown to be interdependent in human studies, correlations were considered. Additionally, system roles corresponding to groups that have shown distinct answering behavior in the corresponding questionnaires were applied to examine the models' ability to reflect characteristics associated with these roles in their responses. Dark personality traits and conspiracy beliefs were not particularly pronounced in either model with little differences between GPT-3.5 and GPT-4. However, GPT-4 showed a pronounced tendency to believe in information withholding. This is particularly intriguing given that GPT-4 is trained on a significantly larger dataset than GPT-3.5. Apparently, in this case an increased data exposure correlates with a greater belief in the control of information. An assignment of extreme political affiliations increased the belief in conspiracy theories. Test sequencing affected the models' responses and the observed correlations, indicating a form of contextual memory.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Adapting tree-based multiple imputation methods for multi-level data? A simulation study
Authors:
Nico Föge,
Jakob Schwerter,
Ketevan Gurtskaia,
Markus Pauly,
Philipp Doebler
Abstract:
When data have a hierarchical structure, such as students nested within classrooms, ignoring dependencies between observations can compromise the validity of imputation procedures. Standard tree-based imputation methods implicitly assume independence between observations, limiting their applicability in multilevel data settings. Although Multivariate Imputation by Chained Equations (MICE) is widel…
▽ More
When data have a hierarchical structure, such as students nested within classrooms, ignoring dependencies between observations can compromise the validity of imputation procedures. Standard tree-based imputation methods implicitly assume independence between observations, limiting their applicability in multilevel data settings. Although Multivariate Imputation by Chained Equations (MICE) is widely used for hierarchical data, it has limitations, including sensitivity to model specification and computational complexity. Alternative tree-based approaches have shown promise for individual-level data, but remain largely unexplored for hierarchical contexts. In this simulation study, we systematically evaluate the performance of novel tree-based methods--Chained Random Forests and Extreme Gradient Boosting (mixgb)--explicitly adapted for multi-level data by incorporating dummy variables indicating cluster membership. We compare these tree-based methods and their adapted versions with traditional MICE imputation in terms of coefficient estimation bias, type I error rates and statistical power, under different cluster sizes, missingness mechanisms and missingness rates, using both random intercept and random slope data generation models. The results show that MICE provides robust and accurate inference for level 2 variables, especially at low missingness rates. However, the adapted boosting approach (mixgb with cluster dummies) consistently outperforms other methods for Level-1 variables at higher missingness rates (30%, 50%). For level 2 variables, while MICE retains better power at moderate missingness (30%), adapted boosting becomes superior at high missingness (50%), regardless of the missingness mechanism or cluster size. These findings highlight the potential of appropriately adapted tree-based imputation methods as effective alternatives to conventional MICE in multilevel data analyses.
△ Less
Submitted 19 March, 2025; v1 submitted 25 January, 2024;
originally announced January 2024.
-
Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies
Authors:
Jakob Schwerter,
Ketevan Gurtskaia,
Andrés Romero,
Birgit Zeyer-Gliozzo,
Markus Pauly
Abstract:
Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science l…
▽ More
Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
How to Simulate Realistic Survival Data? A Simulation Study to Compare Realistic Simulation Models
Authors:
Maria Thurow,
Ina Dormuth,
Christina Sauer,
Marc Ditzhaus,
Markus Pauly
Abstract:
In statistics, it is important to have realistic data sets available for a particular context to allow an appropriate and objective method comparison. For many use cases, benchmark data sets for method comparison are already available online. However, in most medical applications and especially for clinical trials in oncology, there is a lack of adequate benchmark data sets, as patient data can be…
▽ More
In statistics, it is important to have realistic data sets available for a particular context to allow an appropriate and objective method comparison. For many use cases, benchmark data sets for method comparison are already available online. However, in most medical applications and especially for clinical trials in oncology, there is a lack of adequate benchmark data sets, as patient data can be sensitive and therefore cannot be published. A potential solution for this are simulation studies. However, it is sometimes not clear, which simulation models are suitable for generating realistic data. A challenge is that potentially unrealistic assumptions have to be made about the distributions. Our approach is to use reconstructed benchmark data sets %can be used as a basis for the simulations, which has the following advantages: the actual properties are known and more realistic data can be simulated. There are several possibilities to simulate realistic data from benchmark data sets. We investigate simulation models based upon kernel density estimation, fitted distributions, case resampling and conditional bootstrapping. In order to make recommendations on which models are best suited for a specific survival setting, we conducted a comparative simulation study. Since it is not possible to provide recommendations for all possible survival settings in a single paper, we focus on providing realistic simulation models for two-armed phase III lung cancer studies. To this end we reconstructed benchmark data sets from recent studies. We used the runtime and different accuracy measures (effect sizes and p-values) as criteria for comparison.
△ Less
Submitted 29 May, 2024; v1 submitted 15 August, 2023;
originally announced August 2023.
-
General multiple tests for functional data
Authors:
Merle Munko,
Marc Ditzhaus,
Markus Pauly,
Łukasz Smaga,
Jin-Ting Zhang
Abstract:
While there exists several inferential methods for analyzing functional data in factorial designs, there is a lack of statistical tests that are valid (i) in general designs, (ii) under non-restrictive assumptions on the data generating process and (iii) allow for coherent post-hoc analyses. In particular, most existing methods assume Gaussianity or equal covariance functions across groups (homosc…
▽ More
While there exists several inferential methods for analyzing functional data in factorial designs, there is a lack of statistical tests that are valid (i) in general designs, (ii) under non-restrictive assumptions on the data generating process and (iii) allow for coherent post-hoc analyses. In particular, most existing methods assume Gaussianity or equal covariance functions across groups (homoscedasticity) and are only applicable for specific study designs that do not allow for evaluation of interactions. Moreover, all available strategies are only designed for testing global hypotheses and do not directly allow a more in-depth analysis of multiple local hypotheses. To address the first two problems (i)-(ii), we propose flexible integral-type test statistics that are applicable in general factorial designs under minimal assumptions on the data generating process. In particular, we neither postulate homoscedasticity nor Gaussianity. To approximate the statistics' null distribution, we adopt a resampling approach and validate it methodologically. Finally, we use our flexible testing framework to (iii) infer several local null hypotheses simultaneously. To allow for powerful data analysis, we thereby take the complex dependencies of the different local test statistics into account. In extensive simulations we confirm that the new methods are flexibly applicable. Two illustrate data analyses complete our study. The new testing procedures are implemented in the R package multiFANOVA, which will be available on CRAN soon.
△ Less
Submitted 27 June, 2023;
originally announced June 2023.
-
AutoSamp: Autoencoding k-space Sampling via Variational Information Maximization for 3D MRI
Authors:
Cagan Alkan,
Morteza Mardani,
Congyu Liao,
Zhitao Li,
Shreyas S. Vasanawala,
John M. Pauly
Abstract:
Accelerated MRI protocols routinely involve a predefined sampling pattern that undersamples the k-space. Finding an optimal pattern can enhance the reconstruction quality, however this optimization is a challenging task. To address this challenge, we introduce a novel deep learning framework, AutoSamp, based on variational information maximization that enables joint optimization of sampling patter…
▽ More
Accelerated MRI protocols routinely involve a predefined sampling pattern that undersamples the k-space. Finding an optimal pattern can enhance the reconstruction quality, however this optimization is a challenging task. To address this challenge, we introduce a novel deep learning framework, AutoSamp, based on variational information maximization that enables joint optimization of sampling pattern and reconstruction of MRI scans. We represent the encoder as a non-uniform Fast Fourier Transform that allows continuous optimization of k-space sample locations on a non-Cartesian plane, and the decoder as a deep reconstruction network. Experiments on public 3D acquired MRI datasets show improved reconstruction quality of the proposed AutoSamp method over the prevailing variable density and variable density Poisson disc sampling for both compressed sensing and deep learning reconstructions. We demonstrate that our data-driven sampling optimization method achieves 4.4dB, 2.0dB, 0.75dB, 0.7dB PSNR improvements over reconstruction with Poisson Disc masks for acceleration factors of R = 5, 10, 15, 25, respectively. Prospectively accelerated acquisitions with 3D FSE sequences using our optimized sampling patterns exhibit improved image quality and sharpness. Furthermore, we analyze the characteristics of the learned sampling patterns with respect to changes in acceleration factor, measurement noise, underlying anatomy, and coil sensitivities. We show that all these factors contribute to the optimization result by affecting the sampling density, k-space coverage and point spread functions of the learned sampling patterns.
△ Less
Submitted 29 August, 2024; v1 submitted 5 June, 2023;
originally announced June 2023.
-
Finding Neurons in a Haystack: Case Studies with Sparse Probing
Authors:
Wes Gurnee,
Neel Nanda,
Matthew Pauly,
Katherine Harvey,
Dmitrii Troitskii,
Dimitris Bertsimas
Abstract:
Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of f…
▽ More
Despite rapid adoption and deployment of large language models (LLMs), the internal computations of these models remain opaque and poorly understood. In this work, we seek to understand how high-level human-interpretable features are represented within the internal neuron activations of LLMs. We train $k$-sparse linear classifiers (probes) on these internal activations to predict the presence of features in the input; by varying the value of $k$ we study the sparsity of learned representations and how this varies with model scale. With $k=1$, we localize individual neurons which are highly relevant for a particular feature, and perform a number of case studies to illustrate general properties of LLMs. In particular, we show that early layers make use of sparse combinations of neurons to represent many features in superposition, that middle layers have seemingly dedicated neurons to represent higher-level contextual features, and that increasing scale causes representational sparsity to increase on average, but there are multiple types of scaling dynamics. In all, we probe for over 100 unique features comprising 10 different categories in 7 different models spanning 70 million to 6.9 billion parameters.
△ Less
Submitted 2 June, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
The Self-Perception and Political Biases of ChatGPT
Authors:
Jérôme Rutinowski,
Sven Franke,
Jan Endendyk,
Ina Dormuth,
Markus Pauly
Abstract:
This contribution analyzes the self-perception and political biases of OpenAI's Large Language Model ChatGPT. Taking into account the first small-scale reports and studies that have emerged, claiming that ChatGPT is politically biased towards progressive and libertarian points of view, this contribution aims to provide further clarity on this subject. For this purpose, ChatGPT was asked to answer…
▽ More
This contribution analyzes the self-perception and political biases of OpenAI's Large Language Model ChatGPT. Taking into account the first small-scale reports and studies that have emerged, claiming that ChatGPT is politically biased towards progressive and libertarian points of view, this contribution aims to provide further clarity on this subject. For this purpose, ChatGPT was asked to answer the questions posed by the political compass test as well as similar questionnaires that are specific to the respective politics of the G7 member states. These eight tests were repeated ten times each and revealed that ChatGPT seems to hold a bias towards progressive views. The political compass test revealed a bias towards progressive and libertarian views, with the average coordinates on the political compass being (-6.48, -5.99) (with (0, 0) the center of the compass, i.e., centrism and the axes ranging from -10 to 10), supporting the claims of prior research. The political questionnaires for the G7 member states indicated a bias towards progressive views but no significant bias between authoritarian and libertarian views, contradicting the findings of prior reports, with the average coordinates being (-3.27, 0.58). In addition, ChatGPT's Big Five personality traits were tested using the OCEAN test and its personality type was queried using the Myers-Briggs Type Indicator (MBTI) test. Finally, the maliciousness of ChatGPT was evaluated using the Dark Factor test. These three tests were also repeated ten times each, revealing that ChatGPT perceives itself as highly open and agreeable, has the Myers-Briggs personality type ENFJ, and is among the 15% of test-takers with the least pronounced dark traits.
△ Less
Submitted 14 April, 2023;
originally announced April 2023.
-
RODD: Robust Outlier Detection in Data Cubes
Authors:
Lara Kuhlmann,
Daniel Wilmes,
Emmanuel Müller,
Markus Pauly,
Daniel Horn
Abstract:
Data cubes are multidimensional databases, often built from several separate databases, that serve as flexible basis for data analysis. Surprisingly, outlier detection on data cubes has not yet been treated extensively. In this work, we provide the first framework to evaluate robust outlier detection methods in data cubes (RODD). We introduce a novel random forest-based outlier detection approach…
▽ More
Data cubes are multidimensional databases, often built from several separate databases, that serve as flexible basis for data analysis. Surprisingly, outlier detection on data cubes has not yet been treated extensively. In this work, we provide the first framework to evaluate robust outlier detection methods in data cubes (RODD). We introduce a novel random forest-based outlier detection approach (RODD-RF) and compare it with more traditional methods based on robust location estimators. We propose a general type of test data and examine all methods in a simulation study. Moreover, we apply ROOD-RF to real world data. The results show that RODD-RF can lead to improved outlier detection.
△ Less
Submitted 14 March, 2023;
originally announced March 2023.
-
Comparing statistical and machine learning methods for time series forecasting in data-driven logistics -- A simulation study
Authors:
Lena Schmid,
Moritz Roidl,
Markus Pauly
Abstract:
Many planning and decision activities in logistics and supply chain management are based on forecasts of multiple time dependent factors. Therefore, the quality of planning depends on the quality of the forecasts. We compare various forecasting methods in terms of out of the box forecasting performance on a broad set of simulated time series. We simulate various linear and non-linear time series a…
▽ More
Many planning and decision activities in logistics and supply chain management are based on forecasts of multiple time dependent factors. Therefore, the quality of planning depends on the quality of the forecasts. We compare various forecasting methods in terms of out of the box forecasting performance on a broad set of simulated time series. We simulate various linear and non-linear time series and look at the one step forecast performance of statistical learning methods.
△ Less
Submitted 6 June, 2024; v1 submitted 13 March, 2023;
originally announced March 2023.
-
Ultracold plasmas from strongly anti-correlated Rydberg gases in the Kinetic Field Theory formalism
Authors:
Elena Kozlikin,
Robert Lilow,
Martin Pauly,
Alexander Schuckert,
Andre Salzinger,
Matthias Bartelmann,
Matthias Weidemüller
Abstract:
The dynamics of correlated systems is relevant in many fields ranging from cosmology to plasma physics. However, they are challenging to predict and understand even for classical systems due to the typically large numbers of particles involved. Here, we study the evolution of an ultracold, correlated many-body system with repulsive interactions and initial correlations set by the Rydberg blockade…
▽ More
The dynamics of correlated systems is relevant in many fields ranging from cosmology to plasma physics. However, they are challenging to predict and understand even for classical systems due to the typically large numbers of particles involved. Here, we study the evolution of an ultracold, correlated many-body system with repulsive interactions and initial correlations set by the Rydberg blockade using the analytical framework of Kinetic Field Theory (KFT). The KFT formalism is based on the path-integral formulation for classical mechanics and was first developed and successfully used in cosmology to describe structure formation in Dark Matter. The theoretical framework offers a high flexibility regarding the initial configuration and interactions between particles and, in addition, is computationally cheap. More importantly, the analytic approach allows us to gain better insight into the processes which dominate the dynamics. In this work we show that KFT can be applied in a much more general context and study the evolution of a correlated ion plasma. We find good agreement between the analytical KFT results for the evolution of the correlation function and results obtained from numerical simulations. We use the correlation functions obtained with KFT to compute the temperature increase in the ionic system due to disorder-induced heating. For certain choices of parameters we observe that the effect can be reversed, leading to correlation cooling. Due to its numerical efficiency as compared to numerical simulations, a detailed study using KFT can help to constrain parameter spaces where disorder-induced heating is minimal in order to reach the regime of strong coupling.
△ Less
Submitted 3 February, 2023;
originally announced February 2023.
-
Dataset Bias in Human Activity Recognition
Authors:
Nilah Ravi Nair,
Lena Schmid,
Fernando Moya Rueda,
Markus Pauly,
Gernot A. Fink,
Christopher Reining
Abstract:
When creating multi-channel time-series datasets for Human Activity Recognition (HAR), researchers are faced with the issue of subject selection criteria. It is unknown what physical characteristics and/or soft-biometrics, such as age, height, and weight, need to be taken into account to train a classifier to achieve robustness towards heterogeneous populations in the training and testing data. Th…
▽ More
When creating multi-channel time-series datasets for Human Activity Recognition (HAR), researchers are faced with the issue of subject selection criteria. It is unknown what physical characteristics and/or soft-biometrics, such as age, height, and weight, need to be taken into account to train a classifier to achieve robustness towards heterogeneous populations in the training and testing data. This contribution statistically curates the training data to assess to what degree the physical characteristics of humans influence HAR performance. We evaluate the performance of a state-of-the-art convolutional neural network on two HAR datasets that vary in the sensors, activities, and recording for time-series HAR. The training data is intentionally biased with respect to human characteristics to determine the features that impact motion behaviour. The evaluations brought forth the impact of the subjects' characteristics on HAR. Thus, providing insights regarding the robustness of the classifier with respect to heterogeneous populations. The study is a step forward in the direction of fair and trustworthy artificial intelligence by attempting to quantify representation bias in multi-channel time series HAR data.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
The impact of neglected confounding and interactions in mixed-effects meta-regression
Authors:
Eric S. Knop,
Markus Pauly,
Tim Friede,
Thilo Welz
Abstract:
Analysts seldom include interaction terms in meta-regression model, what can introduce bias if an interaction is present. We illustrate this in the current paper by re-analyzing an example from research on acute heart failure, where neglecting an interaction might have led to erroneous inference and conclusions. Moreover, we perform a brief simulation study based on this example highlighting the e…
▽ More
Analysts seldom include interaction terms in meta-regression model, what can introduce bias if an interaction is present. We illustrate this in the current paper by re-analyzing an example from research on acute heart failure, where neglecting an interaction might have led to erroneous inference and conclusions. Moreover, we perform a brief simulation study based on this example highlighting the effects caused by omitting or unnecessarily including interaction terms. Based on our results, we recommend to always include interaction terms in mixed-effects meta-regression models, when such interactions are plausible.
△ Less
Submitted 9 January, 2023;
originally announced January 2023.
-
Using meta-analytic priors to incorporate external information for study evaluation
Authors:
Thilo Welz,
Eric Knop,
Frank Konietschke,
Jan-Hendrik B. Hardenberg,
Markus Pauly,
Christian Röver
Abstract:
Background: The COVID-19 pandemic has had a profound impact on health, everyday life and economics around the world. An important complication that can arise in connection with a COVID-19 infection is acute kidney injury. A recent observational cohort study of COVID-19 patients treated at multiple sites of a tertiary care center in Berlin, Germany identified risk factors for the development of (se…
▽ More
Background: The COVID-19 pandemic has had a profound impact on health, everyday life and economics around the world. An important complication that can arise in connection with a COVID-19 infection is acute kidney injury. A recent observational cohort study of COVID-19 patients treated at multiple sites of a tertiary care center in Berlin, Germany identified risk factors for the development of (severe) acute kidney injury. Since inferring results from a single study can be tricky, we validate these findings and potentially adjust results by including external information from other studies on acute kidney injury and COVID-19.
Methods: We synthesize the results of the main study with other trials via a Bayesian meta-analysis. The external information is used to construct a predictive distribution and to derive posterior estimates for the study of interest. We focus on various important potential risk factors for acute kidney injury development such as mechanical ventilation, use of vasopressors, hypertension, obesity, diabetes, gender and smoking.
Results: Our results show that depending on the degree of heterogeneity in the data the estimated effect sizes may be refined considerably with inclusion of external data. Our findings confirm that mechanical ventilation and use of vasopressors are important risk factors for the development of acute kidney injury in COVID-19 patients. Hypertension also appears to be a risk factor that should not be ignored. Shrinkage weights depended to a large extent on the estimated heterogeneity in the model.
Conclusions: Our work shows how external information can be used to adjust the results from a primary study, using a Bayesian meta-analytic approach. How much information is borrowed from external studies will depend on the degree of heterogeneity present in the model.
△ Less
Submitted 2 December, 2022;
originally announced December 2022.
-
Quantile-based MANOVA: A new tool for inferring multivariate data in factorial designs
Authors:
Marléne Baumeister,
Marc Ditzhaus,
Markus Pauly
Abstract:
Multivariate analysis-of-variance (MANOVA) is a well established tool to examine multivariate endpoints. While classical approaches depend on restrictive assumptions like normality and homogeneity, there is a recent trend to more general and flexible proce dures. In this paper, we proceed on this path, but do not follow the typical mean-focused perspective. Instead we consider general quantiles, i…
▽ More
Multivariate analysis-of-variance (MANOVA) is a well established tool to examine multivariate endpoints. While classical approaches depend on restrictive assumptions like normality and homogeneity, there is a recent trend to more general and flexible proce dures. In this paper, we proceed on this path, but do not follow the typical mean-focused perspective. Instead we consider general quantiles, in particular the median, for a more robust multivariate analysis. The resulting methodology is applicable for all kind of factorial designs and shown to be asymptotically valid. Our theoretical results are complemented by an extensive simulation study for small and moderate sample sizes. An illustrative data analysis is also presented.
△ Less
Submitted 28 November, 2022;
originally announced November 2022.
-
Automated MRI Field of View Prescription from Region of Interest Prediction by Intra-stack Attention Neural Network
Authors:
Ke Lei,
Ali B. Syed,
Xucheng Zhu,
John M. Pauly,
Shreyas S. Vasanawala
Abstract:
Manual prescription of the field of view (FOV) by MRI technologists is variable and prolongs the scanning process. Often, the FOV is too large or crops critical anatomy. We propose a deep-learning framework, trained by radiologists' supervision, for automating FOV prescription. An intra-stack shared feature extraction network and an attention network are used to process a stack of 2D image inputs…
▽ More
Manual prescription of the field of view (FOV) by MRI technologists is variable and prolongs the scanning process. Often, the FOV is too large or crops critical anatomy. We propose a deep-learning framework, trained by radiologists' supervision, for automating FOV prescription. An intra-stack shared feature extraction network and an attention network are used to process a stack of 2D image inputs to generate output scalars defining the location of a rectangular region of interest (ROI). The attention mechanism is used to make the model focus on the small number of informative slices in a stack. Then the smallest FOV that makes the neural network predicted ROI free of aliasing is calculated by an algebraic operation derived from MR sampling theory. We retrospectively collected 595 cases between February 2018 and February 2022. The framework's performance is examined quantitatively with intersection over union (IoU) and pixel error on position, and qualitatively with a reader study. We use the t-test for comparing quantitative results from all models and a radiologist. The proposed model achieves an average IoU of 0.867 and average ROI position error of 9.06 out of 512 pixels on 80 test cases, significantly better (P<0.05) than two baseline models and not significantly different from a radiologist (P>0.12). Finally, the FOV given by the proposed framework achieves an acceptance rate of 92% from an experienced radiologist.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Learning Causal Graphs in Manufacturing Domains using Structural Equation Models
Authors:
Maximilian Kertel,
Stefan Harmeling,
Markus Pauly
Abstract:
Many production processes are characterized by numerous and complex cause-and-effect relationships. Since they are only partially known they pose a challenge to effective process control. In this work we present how Structural Equation Models can be used for deriving cause-and-effect relationships from the combination of prior knowledge and process data in the manufacturing domain. Compared to exi…
▽ More
Many production processes are characterized by numerous and complex cause-and-effect relationships. Since they are only partially known they pose a challenge to effective process control. In this work we present how Structural Equation Models can be used for deriving cause-and-effect relationships from the combination of prior knowledge and process data in the manufacturing domain. Compared to existing applications, we do not assume linear relationships leading to more informative results.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
A comparative study to alternatives to the log-rank test
Authors:
Ina Dormuth,
Tiantian Liu,
Jin Xu,
Markus Pauly,
Marc Ditzhaus
Abstract:
Studies to compare the survival of two or more groups using time-to-event data are of high importance in medical research. The gold standard is the log-rank test, which is optimal under proportional hazards. As the latter is no simple regularity assumption, we are interested in evaluating the power of various statistical tests under different settings including proportional and non-proportional ha…
▽ More
Studies to compare the survival of two or more groups using time-to-event data are of high importance in medical research. The gold standard is the log-rank test, which is optimal under proportional hazards. As the latter is no simple regularity assumption, we are interested in evaluating the power of various statistical tests under different settings including proportional and non-proportional hazards with a special emphasize on crossing hazards. This challenge has been going on for many years now and multiple methods have already been investigated in extensive simulation studies. However, in recent years new omnibus tests and methods based on the restricted mean survival time appeared that have been strongly recommended in biometric literature. Thus, to give updated recommendations, we perform a vast simulation study to compare tests that showed high power in previous studies with these more recent approaches. We thereby analyze various simulation settings with varying survival and censoring distributions, unequal censoring between groups, small sample sizes and unbalanced group sizes. Overall, omnibus tests are more robust in terms of power against deviations from the proportional hazards assumption.
△ Less
Submitted 24 October, 2022;
originally announced October 2022.
-
Testing Hypotheses about Correlation Matrices in General MANOVA Designs
Authors:
Paavo Sattler,
Markus Pauly
Abstract:
Correlation matrices are an essential tool for investigating the dependency structures of random vectors or comparing them. We introduce an approach for testing a variety of null hypotheses that can be formulated based upon the correlation matrix. Examples cover MANOVA-type hypothesis of equal correlation matrices as well as testing for special correlation structures such as, e.g., sphericity. Apa…
▽ More
Correlation matrices are an essential tool for investigating the dependency structures of random vectors or comparing them. We introduce an approach for testing a variety of null hypotheses that can be formulated based upon the correlation matrix. Examples cover MANOVA-type hypothesis of equal correlation matrices as well as testing for special correlation structures such as, e.g., sphericity. Apart from existing fourth moments, our approach requires no other assumptions, allowing applications in various settings. To improve the small sample performance, a bootstrap technique is proposed and theoretically justified. Based on this, we also present a procedure to simultaneously test the hypotheses of equal correlation and equal covariance matrices. The performance of all new test statistics is compared with existing procedures through extensive simulations.
△ Less
Submitted 11 July, 2023; v1 submitted 9 September, 2022;
originally announced September 2022.
-
The nonparametric Behrens-Fisher problem in small samples
Authors:
Claus P. Nowak,
Markus Pauly,
Edgar Brunner
Abstract:
While there appears to be a general consensus in the literature on the definition of the estimand and estimator associated with the Wilcoxon-Mann-Whitney test, it seems somewhat less clear as to how best to estimate the variance. In addition to the Wilcoxon-Mann-Whitney test, we review different proposals of variance estimators consistent under both the null hypothesis and the alternative. Moreove…
▽ More
While there appears to be a general consensus in the literature on the definition of the estimand and estimator associated with the Wilcoxon-Mann-Whitney test, it seems somewhat less clear as to how best to estimate the variance. In addition to the Wilcoxon-Mann-Whitney test, we review different proposals of variance estimators consistent under both the null hypothesis and the alternative. Moreover, in case of small sample sizes, an approximation of the distribution of the test statistic based on the t-distribution, a logit transformation and a permutation approach have been proposed. Focussing as well on different estimators of the degrees of freedom as regards the t-approximation, we carried out simulations for a range of scenarios, with results indicating that the performance of different variance estimators in terms of controlling the type I error rate largely depends on the heteroskedasticity pattern and the sample size allocation ratio, not on the specific type of distributions employed. By and large, a particular t-approximation together with Perme and Manevski's variance estimator best maintains the nominal significance level
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
Inference for high-dimensional split-plot designs with different dimensions between groups
Authors:
Paavo Sattler,
Markus Pauly
Abstract:
In repeated Measure Designs with multiple groups, the primary purpose is to compare different groups in various aspects. For several reasons, the number of measurements and therefore the dimension of the observation vectors can depend on the group, making the usage of existing approaches impossible. We develop an approach which can be used not only for a possibly increasing number of groups $a$, b…
▽ More
In repeated Measure Designs with multiple groups, the primary purpose is to compare different groups in various aspects. For several reasons, the number of measurements and therefore the dimension of the observation vectors can depend on the group, making the usage of existing approaches impossible. We develop an approach which can be used not only for a possibly increasing number of groups $a$, but also for group-depending dimension $d_i$, which is allowed to go to infinity. This is a unique high-dimensional asymptotic framework impressing through its variety and do without usual conditions on the relation between sample size and dimension. It especially includes settings with fixed dimensions in some groups and increasing dimensions in other ones, which can be seen as semi-high-dimensional. To find a appropriate statistic test new and innovative estimators are developed, which can be used under these diverse settings on $a,d_i$ and $n_i$ without any adjustments. We investigated the asymptotic distribution of a quadratic-form-based test statistic and developed an asymptotic correct test. Finally, an extensive simulation study is conducted to investigate the role of the single group's dimension.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
GLEAM: Greedy Learning for Large-Scale Accelerated MRI Reconstruction
Authors:
Batu Ozturkler,
Arda Sahiner,
Tolga Ergen,
Arjun D Desai,
Christopher M Sandino,
Shreyas Vasanawala,
John M Pauly,
Morteza Mardani,
Mert Pilanci
Abstract:
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction. These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization. However, they require several iterations of a large neural network to handle high-dimensional imaging tasks such as 3D MRI. This limits traditional training…
▽ More
Unrolled neural networks have recently achieved state-of-the-art accelerated MRI reconstruction. These networks unroll iterative optimization algorithms by alternating between physics-based consistency and neural-network based regularization. However, they require several iterations of a large neural network to handle high-dimensional imaging tasks such as 3D MRI. This limits traditional training algorithms based on backpropagation due to prohibitively large memory and compute requirements for calculating gradients and storing intermediate activations. To address this challenge, we propose Greedy LEarning for Accelerated MRI (GLEAM) reconstruction, an efficient training strategy for high-dimensional imaging settings. GLEAM splits the end-to-end network into decoupled network modules. Each module is optimized in a greedy manner with decoupled gradient updates, reducing the memory footprint during training. We show that the decoupled gradient updates can be performed in parallel on multiple graphical processing units (GPUs) to further reduce training time. We present experiments with 2D and 3D datasets including multi-coil knee, brain, and dynamic cardiac cine MRI. We observe that: i) GLEAM generalizes as well as state-of-the-art memory-efficient baselines such as gradient checkpointing and invertible networks with the same memory footprint, but with 1.3x faster training; ii) for the same memory footprint, GLEAM yields 1.1dB PSNR gain in 2D and 1.8 dB in 3D over end-to-end baselines.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
Cluster-Robust Estimators for Bivariate Mixed-Effects Meta-Regression
Authors:
Thilo Welz,
Wolfgang Viechtbauer,
Markus Pauly
Abstract:
Meta-analyses frequently include trials that report multiple effect sizes based on a common set of study participants. These effect sizes will generally be correlated. Cluster-robust variance-covariance estimators are a fruitful approach for synthesizing dependent effects. However, when the number of studies is small, state-of-the-art robust estimators can yield inflated Type 1 errors. We present…
▽ More
Meta-analyses frequently include trials that report multiple effect sizes based on a common set of study participants. These effect sizes will generally be correlated. Cluster-robust variance-covariance estimators are a fruitful approach for synthesizing dependent effects. However, when the number of studies is small, state-of-the-art robust estimators can yield inflated Type 1 errors. We present two new cluster-robust estimators, in order to improve small sample performance. For both new estimators the idea is to transform the estimated variances of the residuals using only the diagonal entries of the hat matrix. Our proposals are asymptotically equivalent to previously suggested cluster-robust estimators such as the bias reduced linearization approach. We apply the methods to real world data and compare and contrast their performance in an extensive simulation study. We focus on bivariate meta-regression, although the approaches can be applied more generally.
△ Less
Submitted 4 March, 2022;
originally announced March 2022.
-
Estimating Gaussian Copulas with Missing Data
Authors:
Maximilian Kertel,
Markus Pauly
Abstract:
In this work we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modelling. The joint distribution learned through this algorithm is considerably closer to the underlyin…
▽ More
In this work we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modelling. The joint distribution learned through this algorithm is considerably closer to the underlying distribution than existing methods.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Robust Confidence Intervals for Meta-Regression with Interaction Effects
Authors:
Thilo Welz,
Eric S. Knop,
Tim Friede,
Markus Pauly
Abstract:
Meta-analysis is an important statistical technique for synthesizing the results of multiple studies regarding the same or closely related research question. So-called meta-regression extends meta-analysis models by accounting for studylevel covariates. Mixed-effects meta-regression models provide a powerful tool for evidence synthesis, by appropriately accounting for betweem-study heterogeneity.…
▽ More
Meta-analysis is an important statistical technique for synthesizing the results of multiple studies regarding the same or closely related research question. So-called meta-regression extends meta-analysis models by accounting for studylevel covariates. Mixed-effects meta-regression models provide a powerful tool for evidence synthesis, by appropriately accounting for betweem-study heterogeneity. In fact, modelling the study effect in terms of random effects and moderators not only allows to examine the impact of the moderators, but often leads to more accurate estimates of the involved parameters. Nevertheless, due to the often small number of studies on a specific research topic, interactions are often neglected in meta-regression. In this work, we consider the research questions (i) how moderator interactions influence inference in mixed-effects meta-regression models and (ii) whether some inference methods are more reliable than others. Here, we review robust methods for confidence intervals in meta-regression models including interaction effects. These methods are based on the application of robust sandwich estimators for estimating the variance-covariance matrix of the vector of model coefficients. Furthermore, we compare different versions of these robust estimators in an extensive simulation study. We thereby investigate coverage and length of seven different confidence intervals under varying conditions. We conclude with some practical recommendations.
△ Less
Submitted 21 February, 2023; v1 submitted 14 January, 2022;
originally announced January 2022.
-
Machine Learning for Multi-Output Regression: When should a holistic multivariate approach be preferred over separate univariate ones?
Authors:
Lena Schmid,
Alexander Gerharz,
Andreas Groll,
Markus Pauly
Abstract:
Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopp…
▽ More
Tree-based ensembles such as the Random Forest are modern classics among statistical learning methods. In particular, they are used for predicting univariate responses. In case of multiple outputs the question arises whether we separately fit univariate models or directly follow a multivariate approach. For the latter, several possibilities exist that are, e.g. based on modified splitting or stopping rules for multi-output regression. In this work we compare these methods in extensive simulations to help in answering the primary question when to use multivariate ensemble techniques.
△ Less
Submitted 14 January, 2022;
originally announced January 2022.
-
Using Sequential Statistical Tests for Efficient Hyperparameter Tuning
Authors:
Philip Buczak,
Andreas Groll,
Markus Pauly,
Jakob Rehof,
Daniel Horn
Abstract:
Hyperparameter tuning is one of the the most time-consuming parts in machine learning. Despite the existence of modern optimization algorithms that minimize the number of evaluations needed, evaluations of a single setting may still be expensive. Usually a resampling technique is used, where the machine learning method has to be fitted a fixed number of k times on different training datasets. The…
▽ More
Hyperparameter tuning is one of the the most time-consuming parts in machine learning. Despite the existence of modern optimization algorithms that minimize the number of evaluations needed, evaluations of a single setting may still be expensive. Usually a resampling technique is used, where the machine learning method has to be fitted a fixed number of k times on different training datasets. The respective mean performance of the k fits is then used as performance estimator. Many hyperparameter settings could be discarded after less than k resampling iterations if they are clearly inferior to high-performing settings. However, resampling is often performed until the very end, wasting a lot of computational effort. To this end, we propose the Sequential Random Search (SQRS) which extends the regular random search algorithm by a sequential testing procedure aimed at detecting and eliminating inferior parameter configurations early. We compared our SQRS with regular random search using multiple publicly available regression and classification datasets. Our simulation study showed that the SQRS is able to find similarly well-performing parameter settings while requiring noticeably fewer evaluations. Our results underscore the potential for integrating sequential tests into hyperparameter tuning.
△ Less
Submitted 28 November, 2022; v1 submitted 23 December, 2021;
originally announced December 2021.
-
On the Relation between Prediction and Imputation Accuracy under Missing Covariates
Authors:
Burim Ramosaj,
Justus Tulowietzki,
Markus Pauly
Abstract:
Missing covariates in regression or classification problems can prohibit the direct use of advanced tools for further analysis. Recent research has realized an increasing trend towards the usage of modern Machine Learning algorithms for imputation. It originates from their capability of showing favourable prediction accuracy in different learning problems. In this work, we analyze through simulati…
▽ More
Missing covariates in regression or classification problems can prohibit the direct use of advanced tools for further analysis. Recent research has realized an increasing trend towards the usage of modern Machine Learning algorithms for imputation. It originates from their capability of showing favourable prediction accuracy in different learning problems. In this work, we analyze through simulation the interaction between imputation accuracy and prediction accuracy in regression learning problems with missing covariates when Machine Learning based methods for both, imputation and prediction are used. In addition, we explore imputation performance when using statistical inference procedures in prediction settings, such as coverage rates of (valid) prediction intervals. Our analysis is based on empirical datasets provided by the UCI Machine Learning repository and an extensive simulation study.
△ Less
Submitted 9 December, 2021;
originally announced December 2021.
-
Artifact- and content-specific quality assessment for MRI with image rulers
Authors:
Ke Lei,
John M. Pauly,
Shreyas S. Vasanawala
Abstract:
In clinical practice MR images are often first seen by radiologists long after the scan. If image quality is inadequate either patients have to return for an additional scan, or a suboptimal interpretation is rendered. An automatic image quality assessment (IQA) would enable real-time remediation. Existing IQA works for MRI give only a general quality score, agnostic to the cause of and solution t…
▽ More
In clinical practice MR images are often first seen by radiologists long after the scan. If image quality is inadequate either patients have to return for an additional scan, or a suboptimal interpretation is rendered. An automatic image quality assessment (IQA) would enable real-time remediation. Existing IQA works for MRI give only a general quality score, agnostic to the cause of and solution to low-quality scans. Furthermore, radiologists' image quality requirements vary with the scan type and diagnostic task. Therefore, the same score may have different implications for different scans. We propose a framework with multi-task CNN model trained with calibrated labels and inferenced with image rulers. Labels calibrated by human inputs follow a well-defined and efficient labeling task. Image rulers address varying quality standards and provide a concrete way of interpreting raw scores from the CNN. The model supports assessments of two of the most common artifacts in MRI: noise and motion. It achieves accuracies of around 90%, 6% better than the best previous method examined, and 3% better than human experts on noise assessment. Our experiments show that label calibration, image rulers, and multi-task training improve the model's performance and generalizability.
△ Less
Submitted 5 November, 2021;
originally announced November 2021.
-
VORTEX: Physics-Driven Data Augmentations Using Consistency Training for Robust Accelerated MRI Reconstruction
Authors:
Arjun D Desai,
Beliz Gunel,
Batu M Ozturkler,
Harris Beg,
Shreyas Vasanawala,
Brian A Hargreaves,
Christopher Ré,
John M Pauly,
Akshay S Chaudhari
Abstract:
Deep neural networks have enabled improved image quality and fast inference times for various inverse problems, including accelerated magnetic resonance imaging (MRI) reconstruction. However, such models require a large number of fully-sampled ground truth datasets, which are difficult to curate, and are sensitive to distribution drifts. In this work, we propose applying physics-driven data augmen…
▽ More
Deep neural networks have enabled improved image quality and fast inference times for various inverse problems, including accelerated magnetic resonance imaging (MRI) reconstruction. However, such models require a large number of fully-sampled ground truth datasets, which are difficult to curate, and are sensitive to distribution drifts. In this work, we propose applying physics-driven data augmentations for consistency training that leverage our domain knowledge of the forward MRI data acquisition process and MRI physics to achieve improved label efficiency and robustness to clinically-relevant distribution drifts. Our approach, termed VORTEX, (1) demonstrates strong improvements over supervised baselines with and without data augmentation in robustness to signal-to-noise ratio change and motion corruption in data-limited regimes; (2) considerably outperforms state-of-the-art purely image-based data augmentation techniques and self-supervised reconstruction methods on both in-distribution and out-of-distribution data; and (3) enables composing heterogeneous image-based and physics-driven data augmentations. Our code is available at https://github.com/ad12/meddlr.
△ Less
Submitted 17 June, 2022; v1 submitted 3 November, 2021;
originally announced November 2021.
-
Noise2Recon: Enabling Joint MRI Reconstruction and Denoising with Semi-Supervised and Self-Supervised Learning
Authors:
Arjun D Desai,
Batu M Ozturkler,
Christopher M Sandino,
Robert Boutin,
Marc Willis,
Shreyas Vasanawala,
Brian A Hargreaves,
Christopher M Ré,
John M Pauly,
Akshay S Chaudhari
Abstract:
Deep learning (DL) has shown promise for faster, high quality accelerated MRI reconstruction. However, supervised DL methods depend on extensive amounts of fully-sampled (labeled) data and are sensitive to out-of-distribution (OOD) shifts, particularly low signal-to-noise ratio (SNR) acquisitions. To alleviate this challenge, we propose Noise2Recon, a model-agnostic, consistency training method fo…
▽ More
Deep learning (DL) has shown promise for faster, high quality accelerated MRI reconstruction. However, supervised DL methods depend on extensive amounts of fully-sampled (labeled) data and are sensitive to out-of-distribution (OOD) shifts, particularly low signal-to-noise ratio (SNR) acquisitions. To alleviate this challenge, we propose Noise2Recon, a model-agnostic, consistency training method for joint MRI reconstruction and denoising that can use both fully-sampled (labeled) and undersampled (unlabeled) scans in semi-supervised and self-supervised settings. With limited or no labeled training data, Noise2Recon outperforms compressed sensing and deep learning baselines, including supervised networks, augmentation-based training, fine-tuned denoisers, and self-supervised methods, and matches performance of supervised models, which were trained with 14x more fully-sampled scans. Noise2Recon also outperforms all baselines, including state-of-the-art fine-tuning and augmentation techniques, among low-SNR scans and when generalizing to other OOD factors, such as changes in acceleration factors and different datasets. Augmentation extent and loss weighting hyperparameters had negligible impact on Noise2Recon compared to supervised methods, which may indicate increased training stability. Our code is available at https://github.com/ad12/meddlr.
△ Less
Submitted 7 October, 2022; v1 submitted 30 September, 2021;
originally announced October 2021.
-
On the role of data, statistics and decisions in a pandemic
Authors:
Beate Jahn,
Sarah Friedrich,
Joachim Behnke,
Joachim Engel,
Ursula Garczarek,
Ralf Münnich,
Markus Pauly,
Adalbert Wilhelm,
Olaf Wolkenhauer,
Markus Zwick,
Uwe Siebert,
Tim Friede
Abstract:
A pandemic poses particular challenges to decision-making because of the need to continuously adapt decisions to rapidly changing evidence and available data. For example, which countermeasures are appropriate at a particular stage of the pandemic? How can the severity of the pandemic be measured? What is the effect of vaccination in the population and which groups should be vaccinated first? The…
▽ More
A pandemic poses particular challenges to decision-making because of the need to continuously adapt decisions to rapidly changing evidence and available data. For example, which countermeasures are appropriate at a particular stage of the pandemic? How can the severity of the pandemic be measured? What is the effect of vaccination in the population and which groups should be vaccinated first? The process of decision-making starts with data collection and modeling and continues to the dissemination of results and the subsequent decisions taken. The goal of this paper is to give an overview of this process and to provide recommendations for the different steps from a statistical perspective. In particular, we discuss a range of modeling techniques including mathematical, statistical and decision-analytic models along with their applications in the COVID-19 context. With this overview, we aim to foster the understanding of the goals of these modeling approaches and the specific data requirements that are essential for the interpretation of results and for successful interdisciplinary collaborations. A special focus is on the role played by data in these different models, and we incorporate into the discussion the importance of statistical literacy, and of effective dissemination and communication of findings.
△ Less
Submitted 8 March, 2022; v1 submitted 6 August, 2021;
originally announced August 2021.
-
Towards a Higgs mass determination in asymptotically safe gravity with a dark portal
Authors:
Astrid Eichhorn,
Martin Pauly,
Shouryya Ray
Abstract:
There are indications that an asymptotically safe UV completion of the Standard Model with gravity could constrain the Higgs self-coupling, resulting in a prediction of the Higgs mass close to the vacuum stability bound in the Standard Model. The predicted value depends on the top quark mass and comes out somewhat higher than the experimental value if the current central value for the top quark ma…
▽ More
There are indications that an asymptotically safe UV completion of the Standard Model with gravity could constrain the Higgs self-coupling, resulting in a prediction of the Higgs mass close to the vacuum stability bound in the Standard Model. The predicted value depends on the top quark mass and comes out somewhat higher than the experimental value if the current central value for the top quark mass is assumed. Beyond the Standard Model, the predicted value also depends on dark fields coupled through a Higgs portal. Here we study the Higgs self-coupling in a toy model of the Standard Model with quantum gravity that we extend by a dark scalar and fermion. Within the approximations used in arXiv:2005.03661 , there is a single free parameter in the asymptotically safe dark sector, as a function of which the predicted (toy model) Higgs mass can be lowered due to mixing effects if the dark sector undergoes spontaneous symmetry breaking.
△ Less
Submitted 16 July, 2021;
originally announced July 2021.
-
A sprinkling of hybrid-signature discrete spacetimes in real-world networks
Authors:
Astrid Eichhorn,
Martin Pauly
Abstract:
Many real-world networks are embedded into a space or spacetime. The embedding space(time) constrains the properties of these real-world networks. We use the scale-dependent spectral dimension as a tool to probe whether real-world networks encode information on the dimensionality of the embedding space. We find that spacetime networks which are inspired by quantum gravity and based on a hybrid sig…
▽ More
Many real-world networks are embedded into a space or spacetime. The embedding space(time) constrains the properties of these real-world networks. We use the scale-dependent spectral dimension as a tool to probe whether real-world networks encode information on the dimensionality of the embedding space. We find that spacetime networks which are inspired by quantum gravity and based on a hybrid signature, following the Minkowski metric at small spatial distance and the Euclidean metric at large spatial distance, provide a template relevant for real-world networks of small-world type, including a representation of the internet's architecture and biological neural networks.
△ Less
Submitted 15 July, 2021;
originally announced July 2021.
-
Least Squares Optimal Density Compensation for the Gridding Non-uniform Discrete Fourier Transform
Authors:
Nicholas Dwork,
Daniel O'Connor,
Ethan M. I. Johnson,
Corey A. Baron,
Jeremy W. Gordon,
John M. Pauly,
Peder E. Z. Larson
Abstract:
The Gridding algorithm has shown great utility for reconstructing images from non-uniformly spaced samples in the Fourier domain in several imaging modalities. Due to the non-uniform spacing, some correction for the variable density of the samples must be made. Existing methods for generating density compensation values are either sub-optimal or only consider a finite set of points (a set of measu…
▽ More
The Gridding algorithm has shown great utility for reconstructing images from non-uniformly spaced samples in the Fourier domain in several imaging modalities. Due to the non-uniform spacing, some correction for the variable density of the samples must be made. Existing methods for generating density compensation values are either sub-optimal or only consider a finite set of points (a set of measure 0) in the optimization. This manuscript presents the first density compensation algorithm for a general trajectory that takes into account the point spread function over a set of non-zero measure. We show that the images reconstructed with Gridding using the density compensation values of this method are of superior quality when compared to density compensation weights determined in other ways. Results are shown with a numerical phantom and with magnetic resonance images of the abdomen and the knee.
△ Less
Submitted 16 June, 2021; v1 submitted 11 June, 2021;
originally announced June 2021.