-
Bayesian Nonparametric Sensitivity Analysis of Multiple Test Procedures Under Dependence
Authors:
George Karabatsos
Abstract:
This article introduces a sensitivity analysis method for Multiple Testing Procedures (MTPs) using marginal $p$-values. The method is based on the Dirichlet process (DP) prior distribution, specified to support the entire space of MTPs, where each MTP controls either the family-wise error rate (FWER) or the false discovery rate (FDR) under arbitrary dependence between $p$-values. The DP MTP sensit…
▽ More
This article introduces a sensitivity analysis method for Multiple Testing Procedures (MTPs) using marginal $p$-values. The method is based on the Dirichlet process (DP) prior distribution, specified to support the entire space of MTPs, where each MTP controls either the family-wise error rate (FWER) or the false discovery rate (FDR) under arbitrary dependence between $p$-values. The DP MTP sensitivity analysis method accounts for uncertainty in the selection of such MTPs and their respective cut-off points and decisions regarding which subset of $p$-values are significant from a given set of hypothesis tested, while measuring each $p$-value's probability of significance over the DP prior predictive distribution of this space of all MTPs, and reducing the possible conservativeness of using only one such MTP for multiple testing. The DP MTP sensitivity analysis method is illustrated through the analysis of twenty-eight thousand $p$-values arising from hypothesis tests performed on a 2022 dataset of a representative sample of three million U.S. high school students observed on 239 variables. They include tests that relate variables about the disruption caused by school closures during the COVID-19 pandemic, with variables on mathematical cognition and academic achievement, and with student background variables. R software code for the DP MTP sensitivity analysis method is provided in the Appendix and in Supplementary Information.
△ Less
Submitted 10 May, 2025; v1 submitted 10 October, 2024;
originally announced October 2024.
-
Copula Approximate Bayesian Computation Using Distribution Random Forests
Authors:
George Karabatsos
Abstract:
This invited feature article introduces and provides an extensive simulation study of a new Approximate Bayesian Computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihoods, which unifies and extends previous ABC method. This framework, copulaABcdrf, aims to accurately estimate and de…
▽ More
This invited feature article introduces and provides an extensive simulation study of a new Approximate Bayesian Computation (ABC) framework for estimating the posterior distribution and the maximum likelihood estimate (MLE) of the parameters of models defined by intractable likelihoods, which unifies and extends previous ABC method. This framework, copulaABcdrf, aims to accurately estimate and describe the possibly skewed and high dimensional posterior distribution by a novel multivariate copula-based meta-\textit{t} distribution, based on univariate marginal posterior distributions which can be accurately estimated by Distribution Random Forests (drf), while performing automatic summary statistics (covariates) selection, and robust estimation of copula dependence parameters. The copulaABcdrf framework also provides a novel multivariate mode estimator to perform MLE and posterior mode estimation, and an optional step to perform model selection from a given set of models using posterior probabilities estimated by drf. The posterior distribution estimation accuracy of copulaABcdrf is illustrated and compared to standard ABC methods, through several simulation studies involving low- and high-dimensional models with computable posterior distributions, which are either unimodal, skewed, or multimodal; and exponential random graph and mechanistic network models, each defined by an intractable likelihood from which it is costly to simulate large network datasets. We also study a new solution to the simulation cost problem in ABC. The copulaABcdrf framework and standard ABC methods are further illustrated through analyses of large real-life networks. The results of the simulation and empirical studies, and their implications for future research, are summarized. Keywords: Bayesian analysis, Maximum Likelihood, Intractable likelihood.
△ Less
Submitted 11 September, 2024; v1 submitted 28 February, 2024;
originally announced February 2024.
-
Dirichlet Process Mixture Models with Shrinkage Prior
Authors:
Dawei Ding,
George Karabatsos
Abstract:
We propose Dirichlet Process Mixture (DPM) models for prediction and cluster-wise variable selection, based on two choices of shrinkage baseline prior distributions for the linear regression coefficients, namely the Horseshoe prior and Normal-Gamma prior. We show in a simulation study that each of the two proposed DPM models tend to outperform the standard DPM model based on the non-shrinkage norm…
▽ More
We propose Dirichlet Process Mixture (DPM) models for prediction and cluster-wise variable selection, based on two choices of shrinkage baseline prior distributions for the linear regression coefficients, namely the Horseshoe prior and Normal-Gamma prior. We show in a simulation study that each of the two proposed DPM models tend to outperform the standard DPM model based on the non-shrinkage normal prior, in terms of predictive, variable selection, and clustering accuracy. This is especially true for the Horseshoe model, and when the number of covariates exceeds the within-cluster sample size. A real data set is analyzed to illustrate the proposed modeling methodology, where both proposed DPM models again attained better predictive accuracy.
△ Less
Submitted 25 February, 2021; v1 submitted 21 October, 2020;
originally announced October 2020.
-
Bayes Calculations from Quantile Implied Likelihood
Authors:
George Karabatsos,
Fabrizio Leisen
Abstract:
In statistical practice, a realistic Bayesian model for a given data set can be defined by a likelihood function that is analytically or computationally intractable, due to large data sample size, high parameter dimensionality, or complex likelihood functional form. This in turn poses challenges to the computation and inference of the posterior distribution of the model parameters. For such a mode…
▽ More
In statistical practice, a realistic Bayesian model for a given data set can be defined by a likelihood function that is analytically or computationally intractable, due to large data sample size, high parameter dimensionality, or complex likelihood functional form. This in turn poses challenges to the computation and inference of the posterior distribution of the model parameters. For such a model, a tractable likelihood function is introduced which approximates the exact likelihood through its quantile function. It is defined by an asymptotic chi-square confidence distribution for a pivotal quantity, which is generated by the asymptotic normal distribution of the sample quantiles given model parameters. This Quantile Implied Likelihood (QIL) gives rise to an approximate posterior distribution which can be estimated by using penalized log-likelihood maximization or any suitable Monte Carlo algorithm. The QIL approach to Bayesian Computation is illustrated through the Bayesian analysis of simulated and real data sets having sample sizes that reach the millions. The analyses involve various models for univariate or multivariate iid or non-iid data, with low or high parameter dimensionality, many of which are defined by intractable likelihoods. The probability models include the Student's t, g-and-h, and g-and-k distributions; the Bayesian logit regression model with many covariates; exponential random graph model, a doubly-intractable model for networks; the multivariate skew normal model, for robust inference of the inverse-covariance matrix when it is large relative to the sample size; and the Wallenius distribution model.
△ Less
Submitted 16 March, 2019; v1 submitted 2 February, 2018;
originally announced February 2018.
-
An Approximate Likelihood Perspective on ABC Methods
Authors:
George Karabatsos,
Fabrizio Leisen
Abstract:
We are living in the big data era, as current technologies and networks allow for the easy and routine collection of data sets in different disciplines. Bayesian Statistics offers a flexible modeling approach which is attractive for describing the complexity of these datasets. These models often exhibit a likelihood function which is intractable due to the large sample size, high number of paramet…
▽ More
We are living in the big data era, as current technologies and networks allow for the easy and routine collection of data sets in different disciplines. Bayesian Statistics offers a flexible modeling approach which is attractive for describing the complexity of these datasets. These models often exhibit a likelihood function which is intractable due to the large sample size, high number of parameters, or functional complexity. Approximate Bayesian Computational (ABC) methods provides likelihood-free methods for performing statistical inferences with Bayesian models defined by intractable likelihood functions. The vastity of the literature on ABC methods created a need to review and relate all ABC approaches so that scientists can more readily understand and apply them for their own work. This article provides a unifying review, general representation, and classification of all ABC methods from the view of approximate likelihood theory. This clarifies how ABC methods can be characterized, related, combined, improved, and applied for future research. Possible future research in ABC is then suggested.
△ Less
Submitted 8 May, 2018; v1 submitted 17 August, 2017;
originally announced August 2017.
-
A Dirichlet Process Functional Approach to Heteroscedastic-Consistent Covariance Estimation
Authors:
George Karabatsos
Abstract:
The mixture of Dirichlet process (MDP) defines a flexible prior distribution on the space of probability measures. This study shows that ordinary least-squares (OLS) estimator, as a functional of the MDP posterior distribution, has posterior mean given by weighted least-squares (WLS), and has posterior covariance matrix given by the (weighted) heteroscedastic-consistent sandwich estimator. This is…
▽ More
The mixture of Dirichlet process (MDP) defines a flexible prior distribution on the space of probability measures. This study shows that ordinary least-squares (OLS) estimator, as a functional of the MDP posterior distribution, has posterior mean given by weighted least-squares (WLS), and has posterior covariance matrix given by the (weighted) heteroscedastic-consistent sandwich estimator. This is according to a pairs bootstrap distribution approximation of the posterior, using a Pólya urn scheme. Also, when the MDP prior baseline distribution is specified as a product of independent probability measures, this WLS solution provides a new type of generalized ridge regression estimator which can handle multicollinear or singular design matrices even when the number of covariates exceeds the sample size, and which shrinks the coefficient estimates of irrelevant covariates towards zero, thus useful for nonlinear regressions. Also, this MDP/OLS functional methodology can be extended to methods for analyzing the sensitivity of the heteroscedasticity-consistent causal effect size over a range of hidden biases due to missing covariates omitted from the regression, and more generally extended to a Vibration of Effects analysis. The methodology is illustrated through the analysis of simulated and real data sets. Overall, this study establishes new connections between Dirichlet process functional inference, the bootstrap, consistent sandwich covariance estimation, ridge shrinkage regression, WLS, and sensitivity analysis, to provide regression methodology useful for inferences of the mean dependent response.
△ Less
Submitted 11 June, 2016; v1 submitted 16 February, 2016;
originally announced February 2016.
-
A Menu-Driven Software Package of Bayesian Nonparametric (and Parametric) Mixed Models for Regression Analysis and Density Estimation
Authors:
George Karabatsos
Abstract:
Most of applied statistics involves regression analysis of data. This paper presents a stand-alone and menu-driven software package, Bayesian Regression: Nonparametric and Parametric Models. Currently, this package gives the user a choice from 83 Bayesian models for data analysis. They include 47 Bayesian nonparametric (BNP) infinite-mixture regression models; 5 BNP infinite-mixture models for den…
▽ More
Most of applied statistics involves regression analysis of data. This paper presents a stand-alone and menu-driven software package, Bayesian Regression: Nonparametric and Parametric Models. Currently, this package gives the user a choice from 83 Bayesian models for data analysis. They include 47 Bayesian nonparametric (BNP) infinite-mixture regression models; 5 BNP infinite-mixture models for density estimation; and 31 normal random effects models (HLMs), including normal linear models. Each of the 78 regression models handles either a continuous, binary, or ordinal dependent variable, and can handle multi-level (grouped) data. All 83 Bayesian models can handle the analysis of weighted observations (e.g., for meta-analysis), and the analysis of left-censored, right-censored, and/or interval-censored data. Each BNP infinite-mixture model has a mixture distribution assigned one of various BNP prior distributions, including priors defined by either the Dirichlet process, Pitman-Yor process (including the normalized stable process), beta (two-parameter) process, normalized inverse-Gaussian process, geometric weights prior, dependent Dirichlet process, or the dependent infinite-probits prior. The software user can mouse-click to select a Bayesian model and perform data analysis via Markov chain Monte Carlo (MCMC) sampling. After the sampling completes, the software automatically opens text output that reports MCMC-based estimates of the model's posterior distribution and model predictive fit to the data. Additional text and/or graphical output can be generated by mouse-clicking other menu options. This includes output of MCMC convergence analyses, and estimates of the model's posterior predictive distribution, for selected functionals and values of covariates. The software, constructed from MATLAB Compiler, is illustrated through the BNP regression analysis of real data.
△ Less
Submitted 14 July, 2015; v1 submitted 17 June, 2015;
originally announced June 2015.
-
A Bayesian Nonparametric IRT Model
Authors:
George Karabatsos
Abstract:
This paper introduces a flexible Bayesian nonparametric Item Response Theory (IRT) model, which applies to dichotomous or polytomous item responses, and which can apply to either unidimensional or multidimensional scaling. This is an infinite-mixture IRT model, with person ability and item difficulty parameters, and with a random intercept parameter that is assigned a mixing distribution, with mix…
▽ More
This paper introduces a flexible Bayesian nonparametric Item Response Theory (IRT) model, which applies to dichotomous or polytomous item responses, and which can apply to either unidimensional or multidimensional scaling. This is an infinite-mixture IRT model, with person ability and item difficulty parameters, and with a random intercept parameter that is assigned a mixing distribution, with mixing weights a probit function of other person and item parameters. As a result of its flexibility, the Bayesian nonparametric IRT model can provide outlier-robust estimation of the person ability parameters and the item difficulty parameters in the posterior distribution. The estimation of the posterior distribution of the model is undertaken by standard Markov chain Monte Carlo (MCMC) methods based on slice sampling. This mixture IRT model is illustrated through the analysis of real data obtained from a teacher preparation questionnaire, consisting of polytomous items, and consisting of other covariates that describe the examinees (teachers). For these data, the model obtains zero outliers and an R-squared of one. The paper concludes with a short discussion of how to apply the IRT model for the analysis of item response data, using menu-driven software that was developed by the author.
△ Less
Submitted 11 February, 2015;
originally announced February 2015.
-
Fast Marginal Likelihood Estimation of the Ridge Parameter(s) in Ridge Regression and Generalized Ridge Regression for Big Data
Authors:
George Karabatsos
Abstract:
Unlike the ordinary least-squares (OLS) estimator for the linear model, a ridge regression linear model provides coefficient estimates via shrinkage, usually with improved mean-square and prediction error. This is true especially when the observed design matrix is ill-conditioned or singular, either as a result of highly-correlated covariates or the number of covariates exceeding the sample size.…
▽ More
Unlike the ordinary least-squares (OLS) estimator for the linear model, a ridge regression linear model provides coefficient estimates via shrinkage, usually with improved mean-square and prediction error. This is true especially when the observed design matrix is ill-conditioned or singular, either as a result of highly-correlated covariates or the number of covariates exceeding the sample size. This paper introduces novel and fast marginal maximum likelihood (MML) algorithms for estimating the shrinkage parameter(s) for the Bayesian ridge and power ridge regression models, and an automatic plug-in MML estimator for the Bayesian generalized ridge regression model. With the aid of the singular value decomposition of the observed covariate design matrix, these MML estimation methods are quite fast even for data sets where either the sample size (n) or the number of covariates (p) is very large, and even when p>n. On several real data sets varying widely in terms of n and p, the computation times of the MML estimation methods for the three ridge models, respectively, are compared with the times of other methods for estimating the shrinkage parameter in ridge, LASSO and Elastic Net (EN) models, with the other methods based on minimizing prediction error according to cross-validation or information criteria. Also, the ridge, LASSO, and EN models, and their associated estimation methods, are compared in terms of prediction accuracy. Furthermore, a simulation study compares the ridge models under MML estimation, against the LASSO and EN models, in terms of their ability to differentiate between truly-significant covariates (i.e., with non-zero slope coefficients) and truly-insignificant covariates (with zero coefficients).
△ Less
Submitted 23 June, 2015; v1 submitted 8 September, 2014;
originally announced September 2014.
-
A Bayesian Nonparametric Hypothesis Testing Approach for Regression Discontinuity Designs
Authors:
George Karabatsos,
Stephen G. Walker
Abstract:
The regression discontinuity (RD) design is a popular approach to causal inference in non-randomized studies. This is because it can be used to identify and estimate causal effects under mild conditions. Specifically, for each subject, the RD design assigns a treatment or non-treatment, depending on whether or not an observed value of an assignment variable exceeds a fixed and known cutoff value.…
▽ More
The regression discontinuity (RD) design is a popular approach to causal inference in non-randomized studies. This is because it can be used to identify and estimate causal effects under mild conditions. Specifically, for each subject, the RD design assigns a treatment or non-treatment, depending on whether or not an observed value of an assignment variable exceeds a fixed and known cutoff value.
In this paper, we propose a Bayesian nonparametric regression modeling approach to RD designs, which exploits a local randomization feature. In this approach, the assignment variable is treated as a covariate, and a scalar-valued confounding variable is treated as a dependent variable (which may be a multivariate confounder score). Then, over the model's posterior distribution of locally-randomized subjects that cluster around the cutoff of the assignment variable, inference for causal effects are made within this random cluster, via two-group statistical comparisons of treatment outcomes and non-treatment outcomes.
We illustrate the Bayesian nonparametric approach through the analysis of a real educational data set, to investigate the causal link between basic skills and teaching ability.
△ Less
Submitted 8 February, 2014;
originally announced February 2014.
-
A Bayesian Nonparametric Causal Model for Regression Discontinuity Designs
Authors:
George Karabatsos,
Stephen G. Walker
Abstract:
For non-randomized studies, the regression discontinuity design (RDD) can be used to identify and estimate causal effects from a "locally-randomized" subgroup of subjects, under relatively mild conditions. However, current models focus causal inferences on the impact of the treatment (versus non-treatment) variable on the mean of the dependent variable, via linear regression. For RDDs, we propose…
▽ More
For non-randomized studies, the regression discontinuity design (RDD) can be used to identify and estimate causal effects from a "locally-randomized" subgroup of subjects, under relatively mild conditions. However, current models focus causal inferences on the impact of the treatment (versus non-treatment) variable on the mean of the dependent variable, via linear regression. For RDDs, we propose a flexible Bayesian nonparametric regression model that can provide accurate estimates of causal effects, in terms of the predictive mean, variance, quantile, probability density, distribution function, or any other chosen function of the outcome variable. We illustrate the model through the analysis of two real educational data sets, involving (resp.) a sharp RDD and a fuzzy RDD.
△ Less
Submitted 11 February, 2015; v1 submitted 18 November, 2013;
originally announced November 2013.
-
On Bayesian Nonparametric Continuous Time Series Models
Authors:
George Karabatsos,
Stephen G. Walker
Abstract:
This paper is a note on the use of Bayesian nonparametric mixture models for continuous time series. We identify a key requirement for such models, and then establish that there is a single type of model which meets this requirement. As it turns out, the model is well known in multiple change-point problems.
This paper is a note on the use of Bayesian nonparametric mixture models for continuous time series. We identify a key requirement for such models, and then establish that there is a single type of model which meets this requirement. As it turns out, the model is well known in multiple change-point problems.
△ Less
Submitted 2 March, 2013;
originally announced March 2013.
-
Informant Discrepancies and the Heritability of Antisocial Behavior: A Meta-Analysis
Authors:
Elizabeth Talbott,
George Karabatsos,
Jaime Zurheide
Abstract:
Antisocial behavior, which includes both aggressive and delinquent activities, is the opposite of prosocial behavior. Researchers have studied the heritability of antisocial behavior among twin and non-twin sibling pairs from behavioral ratings made by parents, teachers, observers, and youth. Through a meta-analysis, we examined longitudinal and cross sectional research in the behavioral genetics…
▽ More
Antisocial behavior, which includes both aggressive and delinquent activities, is the opposite of prosocial behavior. Researchers have studied the heritability of antisocial behavior among twin and non-twin sibling pairs from behavioral ratings made by parents, teachers, observers, and youth. Through a meta-analysis, we examined longitudinal and cross sectional research in the behavioral genetics of antisocial behavior, consisting of 42 studies, of which 38 were studies of twin pairs, 3 were studies of twins and non-twin siblings, and 1 was a study of adoptees. These studies provided n = 89 heritability (h2) effect size estimates from a total of 94,517 sibling pairs who ranged in age from 1.5 to 18 years; studies provided data for 29 moderators (predictors). We employed a random-effects meta-analysis model to achieve three goals: (a) perform statistical inference of the overall heritability distribution in the underlying population of studies, (b) identify significant study level moderators (predictors) of heritability, and (c) examine how the heritability distribution varied as a function of age and type of informant, particularly in longitudinal research. The meta-analysis indicated a bimodal overall heritability distribution, indicating two clusters of moderate and high heritability values, respectively; identified four moderators that predicted significant changes in mean heritability; and indicated differential patterns of median h2 and variance (interquartile ranges) across informants and ages. We argue for a cross-perspective, cross-setting model for selecting informants in behavioral genetic research, that is flexible and sensitive to changes in antisocial behavior over time.
△ Less
Submitted 7 February, 2013;
originally announced February 2013.
-
A Bayesian Nonparametric Meta-Analysis Model
Authors:
George Karabatsos,
Elizabeth Talbott,
Stephen G. Walker
Abstract:
In a meta-analysis, it is important to specify a model that adequately describes the effect-size distribution of the underlying population of studies. The conventional normal fixed-effect and normal random-effects models assume a normal effect-size population distribution, conditionally on parameters and covariates. For estimating the mean overall effect size, such models may be adequate, but for…
▽ More
In a meta-analysis, it is important to specify a model that adequately describes the effect-size distribution of the underlying population of studies. The conventional normal fixed-effect and normal random-effects models assume a normal effect-size population distribution, conditionally on parameters and covariates. For estimating the mean overall effect size, such models may be adequate, but for prediction they surely are not if the effect size distribution exhibits non-normal behavior. To address this issue, we propose a Bayesian nonparametric meta-analysis model, which can describe a wider range of effect-size distributions, including unimodal symmetric distributions, as well as skewed and more multimodal distributions. We demonstrate our model through the analysis of real meta-analytic data arising from behavioral-genetic research. We compare the predictive performance of the Bayesian nonparametric model against various conventional and more modern normal fixed-effects and random-effects models.
△ Less
Submitted 17 October, 2013; v1 submitted 31 January, 2013;
originally announced January 2013.
-
Dependent Dirichlet Process Rating Model (DDP-RM)
Authors:
Ken Akira Fujimoto,
George Karabatsos
Abstract:
Typical IRT rating-scale models assume that the rating category threshold parameters are the same over examinees. However, it can be argued that many rating data sets violate this assumption. To address this practical psychometric problem, we introduce a novel, Bayesian nonparametric IRT model for rating scale items. The model is an infinite-mixture of Rasch partial credit models, based on a local…
▽ More
Typical IRT rating-scale models assume that the rating category threshold parameters are the same over examinees. However, it can be argued that many rating data sets violate this assumption. To address this practical psychometric problem, we introduce a novel, Bayesian nonparametric IRT model for rating scale items. The model is an infinite-mixture of Rasch partial credit models, based on a localized Dependent Dirichlet process (DDP). The model treats the rating thresholds as the random parameters that are subject to the mixture, and has (stick-breaking) mixture weights that are covariate-dependent. Thus, the novel model allows the rating category thresholds to vary flexibly across items and examinees, and allows the distribution of the category thresholds to vary flexibly as a function of covariates. We illustrate the new model through the analysis of a simulated data set, and through the analysis of a real rating data set that is well-known in the psychometric literature. The model is shown to have better predictive-fit performance, compared to other commonly used IRT rating models.
△ Less
Submitted 20 March, 2013; v1 submitted 20 December, 2012;
originally announced December 2012.
-
A Latent-Variable Bayesian Nonparametric Regression Model
Authors:
George Karabatsos,
Stephen G. Walker
Abstract:
We introduce a random partition model for Bayesian nonparametric regression. The model is based on infinitely-many disjoint regions of the range of a latent covariate-dependent Gaussian process. Given a realization of the process, the cluster of dependent variable responses that share a common region are assumed to arise from the same distribution. Also, the latent Gaussian process prior allows fo…
▽ More
We introduce a random partition model for Bayesian nonparametric regression. The model is based on infinitely-many disjoint regions of the range of a latent covariate-dependent Gaussian process. Given a realization of the process, the cluster of dependent variable responses that share a common region are assumed to arise from the same distribution. Also, the latent Gaussian process prior allows for the random partitions (i.e., clusters of the observations) to exhibit dependencies among one another. The model is illustrated through the analysis of a real data set arising from education, and through the analysis of simulated data that were generated from complex data-generating models.
△ Less
Submitted 2 January, 2013; v1 submitted 15 December, 2012;
originally announced December 2012.