Search | arXiv e-print repository

A dynamic copula model for probabilistic forecasting of non-Gaussian multivariate time series

Abstract: Multivariate time series (MTS) data often include a heterogeneous mix of non-Gaussian distributional features (asymmetry, multimodality, heavy tails) and data types (continuous and discrete variables). Traditional MTS methods based on convenient parametric distributions are typically ill-equipped to model this heterogeneity. Copula models provide an appealing alternative, but present significant o… ▽ More Multivariate time series (MTS) data often include a heterogeneous mix of non-Gaussian distributional features (asymmetry, multimodality, heavy tails) and data types (continuous and discrete variables). Traditional MTS methods based on convenient parametric distributions are typically ill-equipped to model this heterogeneity. Copula models provide an appealing alternative, but present significant obstacles for fully Bayesian inference and probabilistic forecasting. To overcome these challenges, we propose a novel and general strategy for posterior approximation in MTS copula models and apply it to a Gaussian copula built from a dynamic factor model. This framework provides scalable, fully Bayesian inference for cross-sectional and serial dependencies and nonparametrically learns heterogeneous marginal distributions. We validate this approach by establishing posterior consistency and confirm excellent finite-sample performance even under model misspecification using simulated data. We apply our method to crime count and macroeconomic MTS data and find superior probabilistic forecasting performance compared to popular MTS models. These results demonstrate that the proposed method is a versatile, general-purpose utility for probabilistic forecasting of MTS that works well across of range of applications with minimal user input. △ Less

Submitted 24 February, 2025; originally announced February 2025.

Comments: 49 pages, 10 figures, 4 tables

arXiv:2412.02970 [pdf, other]

Uncovering dynamics between SARS-CoV-2 wastewater concentrations and community infections via Bayesian spatial functional concurrent regression

Authors: Thomas Y. Sun, Julia C. Schedler, Daniel R. Kowal, Rebecca Schneider, Lauren B. Stadler, Loren Hopkins, Katherine B. Ensor

Abstract: Monitoring wastewater concentrations of SARS-CoV-2 yields a low-cost, noninvasive method for tracking disease prevalence and provides early warning signs of upcoming outbreaks in the serviced communities. There is tremendous clinical and public health interest in understanding the exact dynamics between wastewater viral loads and infection rates in the population. As both data sources may contain… ▽ More Monitoring wastewater concentrations of SARS-CoV-2 yields a low-cost, noninvasive method for tracking disease prevalence and provides early warning signs of upcoming outbreaks in the serviced communities. There is tremendous clinical and public health interest in understanding the exact dynamics between wastewater viral loads and infection rates in the population. As both data sources may contain substantial noise and missingness, in addition to spatial and temporal dependencies, properly modeling this relationship must address these numerous complexities simultaneously while providing interpretable and clear insights. We propose a novel Bayesian functional concurrent regression model that accounts for both spatial and temporal correlations while estimating the dynamic effects between wastewater concentrations and positivity rates over time. We explicitly model the time lag between the two series and provide full posterior inference on the possible delay between spikes in wastewater concentrations and subsequent outbreaks. We estimate a time lag likely between 5 to 11 days between spikes in wastewater levels and reported clinical positivity rates. Additionally, we find a dynamic relationship between wastewater concentration levels and the strength of its association with positivity rates that fluctuates between outbreaks and non-outbreaks. △ Less

Submitted 3 December, 2024; originally announced December 2024.

arXiv:2411.18477 [pdf]

doi 10.1002/adma.202417874

Scaling Up Purcell-Enhanced Self-Assembled Nanoplasmonic Perovskite Scintillators into the Bulk Regime

Authors: Michal Makowski, Wenzheng Ye, Dominik Kowal, Francesco Maddalena, Somnath Mahato, Yudhistira Tirtayasri Amrillah, Weronika Zajac, Marcin Eugeniusz Witkowski, Konrad Jacek Drozdowski, Nathaniel, Cuong Dang, Joanna Cybinska, Winicjusz Drozdowski, Ferry Anggoro Ardy Nugroho, Christophe Dujardin, Liang Jie Wong, Muhammad Danang Birowosuto

Abstract: Scintillators convert high-energy radiation into detectable photons and play a crucial role in medical imaging and security applications. The enhancement of scintillator performance through nanophotonics and nanoplasmonics, specifically using the Purcell effect, has shown promise but has so far been limited to ultrathin scintillator films because of the localized nature of this effect. This study… ▽ More Scintillators convert high-energy radiation into detectable photons and play a crucial role in medical imaging and security applications. The enhancement of scintillator performance through nanophotonics and nanoplasmonics, specifically using the Purcell effect, has shown promise but has so far been limited to ultrathin scintillator films because of the localized nature of this effect. This study introduces a method to expand the application of nanoplasmonic scintillators to the bulk regime. By integrating 100-nm-sized plasmonic spheroid and cuboid nanoparticles with perovskite scintillator nanocrystals, we enable nanoplasmonic scintillators to function effectively within bulk-scale devices. We experimentally demonstrate power and decay rate enhancements of up to (3.20 $\pm$ 0.20) and (4.20 $\pm$ 0.31) folds for plasmonic spheroid and cuboid nanoparticles, respectively, in a 5-mm thick CsPbBr$_{3}$ nanocrystal-polymer scintillator at RT. Theoretical modeling also predicts similar enhancements of up to (2.26 $\pm$ 0.31) and (3.02 $\pm$ 0.69) folds for the same nanoparticle shapes and dimensions. Moreover, we demonstrate a (2.07 $\pm$ 0.39) fold increase in light yield under $^{241}$Am $γ$-excitation. These findings provide a viable pathway for utilizing nanoplasmonics to enhance bulk scintillator devices, advancing radiation detection technology. △ Less

Submitted 13 May, 2025; v1 submitted 27 November, 2024; originally announced November 2024.

Comments: 60 pages with 17 figures, split between main text and supporting information. This is a full-length research article (version 5). Updated corrections

Journal ref: Advanced Materials 2025

arXiv:2408.00618 [pdf, other]

Facilitating heterogeneous effect estimation via statistically efficient categorical modifiers

Authors: Daniel R. Kowal

Abstract: Categorical covariates such as race, sex, or group are ubiquitous in regression analysis. While main-only (or ANCOVA) linear models are predominant, cat-modified linear models that include categorical-continuous or categorical-categorical interactions are increasingly important and allow heterogeneous, group-specific effects. However, with standard approaches, the addition of cat-modifiers fundame… ▽ More Categorical covariates such as race, sex, or group are ubiquitous in regression analysis. While main-only (or ANCOVA) linear models are predominant, cat-modified linear models that include categorical-continuous or categorical-categorical interactions are increasingly important and allow heterogeneous, group-specific effects. However, with standard approaches, the addition of cat-modifiers fundamentally alters the estimates and interpretations of the main effects, often inflates their standard errors, and introduces significant concerns about group (e.g., racial) biases. We advocate an alternative parametrization and estimation scheme using abundance-based constraints (ABCs). ABCs induce a model parametrization that is both interpretable and equitable. Crucially, we show that with ABCs, the addition of cat-modifiers 1) leaves main effect estimates unchanged and 2) enhances their statistical power, under reasonable conditions. Thus, analysts can, and arguably should include cat-modifiers in linear regression models to discover potential heterogeneous effects--without compromising estimation, inference, and interpretability for the main effects. Using simulated data, we verify these invariance properties for estimation and inference and showcase the capabilities of ABCs to increase statistical power. We apply these tools to study demographic heterogeneities among the effects of social and environmental factors on STEM educational outcomes for children in North Carolina. An R package lmabc is available. △ Less

Submitted 1 August, 2024; originally announced August 2024.

arXiv:2406.03463 [pdf, other]

Gaussian Copula Models for Nonignorable Missing Data Using Auxiliary Marginal Quantiles

Authors: Joseph Feldman, Jerome P. Reiter, Daniel R. Kowal

Abstract: We present an approach for modeling and imputation of nonignorable missing data. Our approach uses Bayesian data integration to combine (1) a Gaussian copula model for all study variables and missingness indicators, which allows arbitrary marginal distributions, nonignorable missingess, and other dependencies, and (2) auxiliary information in the form of marginal quantiles for some study variables… ▽ More We present an approach for modeling and imputation of nonignorable missing data. Our approach uses Bayesian data integration to combine (1) a Gaussian copula model for all study variables and missingness indicators, which allows arbitrary marginal distributions, nonignorable missingess, and other dependencies, and (2) auxiliary information in the form of marginal quantiles for some study variables. We prove that, remarkably, one only needs a small set of accurately-specified quantiles to estimate the copula correlation consistently. The remaining marginal distribution functions are inferred nonparametrically and jointly with the copula parameters using an efficient MCMC algorithm. We also characterize the (additive) nonignorable missingness mechanism implied by the copula model. Simulations confirm the effectiveness of this approach for multivariate imputation with nonignorable missing data. We apply the model to analyze associations between lead exposure and end-of-grade test scores for 170,000 North Carolina students. Lead exposure has nonignorable missingness: children with higher exposure are more likely to be measured. We elicit marginal quantiles for lead exposure using statistics provided by the Centers for Disease Control and Prevention. Multiple imputation inferences under our model support stronger, more adverse associations between lead exposure and educational outcomes relative to complete case and missing-at-random analyses. △ Less

Submitted 16 November, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

Comments: 35 pages (main text only), 8 Figures

arXiv:2405.03041 [pdf, other]

Bayesian Functional Graphical Models with Change-Point Detection

Authors: Chunshan Liu, Daniel R. Kowal, James Doss-Gollin, Marina Vannucci

Abstract: Functional data analysis, which models data as realizations of random functions over a continuum, has emerged as a useful tool for time series data. Often, the goal is to infer the dynamic connections (or time-varying conditional dependencies) among multiple functions or time series. For this task, a dynamic and Bayesian functional graphical model is introduced. The proposed modeling approach prio… ▽ More Functional data analysis, which models data as realizations of random functions over a continuum, has emerged as a useful tool for time series data. Often, the goal is to infer the dynamic connections (or time-varying conditional dependencies) among multiple functions or time series. For this task, a dynamic and Bayesian functional graphical model is introduced. The proposed modeling approach prioritizes the careful definition of an appropriate graph to identify both time-invariant and time-varying connectivity patterns. A novel block-structured sparsity prior is paired with a finite basis expansion, which together yield effective shrinkage and graph selection with efficient computations via a Gibbs sampling algorithm. Crucially, the model includes (one or more) graph changepoints, which are learned jointly with all model parameters and incorporate graph dynamics. Simulation studies demonstrate excellent graph selection capabilities, with significant improvements over competing methods. The proposed approach is applied to study of dynamic connectivity patterns of sea surface temperatures in the Pacific Ocean and reveals meaningful edges. △ Less

Submitted 7 December, 2024; v1 submitted 5 May, 2024; originally announced May 2024.

Comments: Revised for Computational Statistics and Data Analysis

arXiv:2311.02043 [pdf, other]

Bayesian Quantile Regression with Subset Selection: A Decision Analysis Perspective

Authors: Joseph Feldman, Daniel Kowal

Abstract: Quantile regression is a powerful tool for inferring how covariates affect specific percentiles of the response distribution. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or non-parametric models. The former often produce inadequate models for real data and do not share information across… ▽ More Quantile regression is a powerful tool for inferring how covariates affect specific percentiles of the response distribution. Existing methods either estimate conditional quantiles separately for each quantile of interest or estimate the entire conditional distribution using semi- or non-parametric models. The former often produce inadequate models for real data and do not share information across quantiles, while the latter are characterized by complex and constrained models that can be difficult to interpret and computationally inefficient. Neither approach is well-suited for quantile-specific subset selection. Instead, we pose the fundamental problems of linear quantile estimation, uncertainty quantification, and subset selection from a Bayesian decision analysis perspective. For any Bayesian regression model -- including, but not limited to existing Bayesian quantile regression models -- we derive optimal point estimates, interpretable uncertainty quantification, and scalable subset selection techniques for all model-based conditional quantiles. Our approach introduces a quantile-focused squared error loss that enables efficient, closed-form computing and maintains a close relationship with Wasserstein-based density estimation. In an extensive simulation study, our methods demonstrate substantial gains in quantile estimation accuracy, inference, and variable selection over frequentist and Bayesian competitors. We use these tools to identify and quantify the heterogeneous impacts of multiple social stressors and environmental exposures on educational outcomes across the full spectrum of low-, medium-, and high-achieving students in North Carolina. △ Less

Submitted 16 November, 2024; v1 submitted 3 November, 2023; originally announced November 2023.

arXiv:2309.06320 [pdf, other]

doi 10.1002/adma.202309410

The Nanoplasmonic Purcell Effect in Ultrafast and High-Light-Yield Perovskite Scintillators

Authors: Wenzheng Ye, Zhihua Yong, Michael Go, Dominik Kowal, Francesco Maddalena, Liliana Tjahjana, Wang Hong, Arramel Arramel, Christophe Dujardin, Muhammad Danang Birowosuto, Liang Jie Wong

Abstract: The development of X-ray scintillators with ultrahigh light yields and ultrafast response times is a long sought-after goal. In this work, we theoretically predict and experimentally demonstrate a fundamental mechanism that pushes the frontiers of ultrafast X-ray scintillator performance: the use of nanoscale-confined surface plasmon polariton modes to tailor the scintillator response time via the… ▽ More The development of X-ray scintillators with ultrahigh light yields and ultrafast response times is a long sought-after goal. In this work, we theoretically predict and experimentally demonstrate a fundamental mechanism that pushes the frontiers of ultrafast X-ray scintillator performance: the use of nanoscale-confined surface plasmon polariton modes to tailor the scintillator response time via the Purcell effect. By incorporating nanoplasmonic materials in scintillator devices, this work predicts over 10-fold enhancement in decay rate and 38% reduction in time resolution even with only a simple planar design. We experimentally demonstrate the nanoplasmonic Purcell effect using perovskite scintillators, enhancing the light yield by over 120% to 88 $\pm$ 11 ph/keV, and the decay rate by over 60% to 2.0 $\pm$ 0.2 ns for the average decay time, and 0.7 $\pm$ 0.1 ns for the ultrafast decay component, in good agreement with the predictions of our theoretical framework. We perform proof-of-concept X-ray imaging experiments using nanoplasmonic scintillators, demonstrating 182% enhancement in the modulation transfer function at 4 line pairs per millimeter spatial frequency. This work highlights the enormous potential of nanoplasmonics in optimizing ultrafast scintillator devices for applications including time-of-flight X-ray imaging and photon-counting computed tomography. △ Less

Submitted 12 September, 2023; originally announced September 2023.

Comments: 34 pages, 3 figures

arXiv:2308.14996 [pdf, other]

The projected dynamic linear model for time series on the sphere

Authors: John Zito, Daniel Kowal

Abstract: Time series on the unit n-sphere arise in directional statistics, compositional data analysis, and many scientific fields. There are few models for such data, and the ones that exist suffer from several limitations: they are often computationally challenging to fit, many of them apply only to the circular case of n=2, and they are usually based on families of distributions that are not flexible en… ▽ More Time series on the unit n-sphere arise in directional statistics, compositional data analysis, and many scientific fields. There are few models for such data, and the ones that exist suffer from several limitations: they are often computationally challenging to fit, many of them apply only to the circular case of n=2, and they are usually based on families of distributions that are not flexible enough to capture the complexities observed in real data. Furthermore, there is little work on Bayesian methods for spherical time series. To address these shortcomings, we propose a state space model based on the projected normal distribution that can be applied to spherical time series of arbitrary dimension. We describe how to perform fully Bayesian offline inference for this model using a simple and efficient Gibbs sampling algorithm, and we develop a Rao-Blackwellized particle filter to perform online inference for streaming data. In analyses of wind direction and energy market time series, we show that the proposed model outperforms competitors in terms of point, set, and density forecasting. △ Less

Submitted 4 August, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: 28 pages, 8 figures

arXiv:2306.07168 [pdf, other]

Ultra-efficient MCMC for Bayesian longitudinal functional data analysis

Authors: Thomas Y. Sun, Daniel R. Kowal

Abstract: Functional mixed models are widely useful for regression analysis with dependent functional data, including longitudinal functional data with scalar predictors. However, existing algorithms for Bayesian inference with these models only provide either scalable computing or accurate approximations to the posterior distribution, but not both. We introduce a new MCMC sampling strategy for highly effic… ▽ More Functional mixed models are widely useful for regression analysis with dependent functional data, including longitudinal functional data with scalar predictors. However, existing algorithms for Bayesian inference with these models only provide either scalable computing or accurate approximations to the posterior distribution, but not both. We introduce a new MCMC sampling strategy for highly efficient and fully Bayesian regression with longitudinal functional data. Using a novel blocking structure paired with an orthogonalized basis reparametrization, our algorithm jointly samples the fixed effects regression functions together with all subject- and replicate-specific random effects functions. Crucially, the joint sampler optimizes sampling efficiency for these key parameters while preserving computational scalability. Perhaps surprisingly, our new MCMC sampling algorithm even surpasses state-of-the-art algorithms for frequentist estimation and variational Bayes approximations for functional mixed models -- while also providing accurate posterior uncertainty quantification -- and is orders of magnitude faster than existing Gibbs samplers. Simulation studies show improved point estimation and interval coverage in nearly all simulation settings over competing approaches. We apply our method to a large physical activity dataset to study how various demographic and health factors associate with intraday activity. △ Less

Submitted 12 June, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

arXiv:2306.05498 [pdf, other]

doi 10.1080/01621459.2024.2395586

Monte Carlo inference for semiparametric Bayesian regression

Authors: Daniel R. Kowal, Bohan Wu

Abstract: Data transformations are essential for broad applicability of parametric regression models. However, for Bayesian analysis, joint inference of the transformation and model parameters typically involves restrictive parametric transformations or nonparametric representations that are computationally inefficient and cumbersome for implementation and theoretical analysis, which limits their usability… ▽ More Data transformations are essential for broad applicability of parametric regression models. However, for Bayesian analysis, joint inference of the transformation and model parameters typically involves restrictive parametric transformations or nonparametric representations that are computationally inefficient and cumbersome for implementation and theoretical analysis, which limits their usability in practice. This paper introduces a simple, general, and efficient strategy for joint posterior inference of an unknown transformation and all regression model parameters. The proposed approach directly targets the posterior distribution of the transformation by linking it with the marginal distributions of the independent and dependent variables, and then deploys a Bayesian nonparametric model via the Bayesian bootstrap. Crucially, this approach delivers (1) joint posterior consistency under general conditions, including multiple model misspecifications, and (2) efficient Monte Carlo (not Markov chain Monte Carlo) inference for the transformation and all parameters for important special cases. These tools apply across a variety of data domains, including real-valued, positive, and compactly-supported data. Simulation studies and an empirical application demonstrate the effectiveness and efficiency of this strategy for semiparametric Bayesian analysis with linear models, quantile regression, and Gaussian processes. The R package SeBR is available on CRAN. △ Less

Submitted 29 July, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

arXiv:2210.14988 [pdf, other]

Nonparametric Copula Models for Multivariate, Mixed, and Missing Data

Authors: Joseph Feldman, Daniel R. Kowal

Abstract: Modern datasets commonly feature both substantial missingness and many variables of mixed data types, which present significant challenges for estimation and inference. Complete case analysis, which proceeds using only the observations with fully-observed variables, is often severely biased, while model-based imputation of missing values is limited by the ability of the model to capture complex de… ▽ More Modern datasets commonly feature both substantial missingness and many variables of mixed data types, which present significant challenges for estimation and inference. Complete case analysis, which proceeds using only the observations with fully-observed variables, is often severely biased, while model-based imputation of missing values is limited by the ability of the model to capture complex dependencies among (possibly many) variables of mixed data types. To address these challenges, we develop a novel Bayesian mixture copula for joint and nonparametric modeling of multivariate count, continuous, ordinal, and unordered categorical variables, and deploy this model for inference, prediction, and imputation of missing data. Most uniquely, we introduce a new and computationally efficient strategy for marginal distribution estimation that eliminates the need to specify any marginal models yet delivers posterior consistency for each marginal distribution and the copula parameters under missingness-at-random. Extensive simulation studies demonstrate exceptional modeling and imputation capabilities relative to competing methods, especially with mixed data types, complex missingness mechanisms, and nonlinear dependencies. We conclude with a data analysis that highlights how improper treatment of missing data can distort a statistical analysis, and how the proposed approach offers a resolution. △ Less

Submitted 7 April, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

Comments: 65 pages, 18 figures, 2 tables

arXiv:2203.00784 [pdf, other]

Bayesian adaptive and interpretable functional regression for exposure profiles

Authors: Yunan Gao, Daniel R. Kowal

Abstract: Pollutant exposure during gestation is a known and adverse factor for birth and health outcomes. However, the links between prenatal air pollution exposures and educational outcomes are less clear, in particular the critical windows of susceptibility during pregnancy. Using a large cohort of students in North Carolina, we study the link between prenatal daily $\mbox{PM}_{2.5}$ exposure and 4th end… ▽ More Pollutant exposure during gestation is a known and adverse factor for birth and health outcomes. However, the links between prenatal air pollution exposures and educational outcomes are less clear, in particular the critical windows of susceptibility during pregnancy. Using a large cohort of students in North Carolina, we study the link between prenatal daily $\mbox{PM}_{2.5}$ exposure and 4th end-of-grade reading scores. We develop and apply a locally adaptive and highly scalable Bayesian regression model for scalar responses with functional and scalar predictors. The proposed model pairs a B-spline basis expansion with dynamic shrinkage priors to capture both smooth and rapidly-changing features in the regression surface. The model is accompanied by a new decision analysis approach for functional regression that extracts the critical windows of susceptibility and guides the model interpretations. These tools help to identify and address broad limitations with the interpretability of functional regression models. Simulation studies demonstrate more accurate point estimation, more precise uncertainty quantification, and far superior window selection than existing approaches. Leveraging the proposed modeling, computational, and decision analysis framework, we conclude that prenatal $\mbox{PM}_{2.5}$ exposure during early and late pregnancy is most adverse for 4th end-of-grade reading scores. △ Less

Submitted 10 October, 2022; v1 submitted 1 March, 2022; originally announced March 2022.

Comments: Main paper: 32 pages, 11 figures Supplementary materials: 10 pages, 5 figures

arXiv:2110.14790 [pdf, other]

doi 10.1214/23-BA1394

Warped Dynamic Linear Models for Time Series of Counts

Authors: Brian King, Daniel R. Kowal

Abstract: Dynamic Linear Models (DLMs) are commonly employed for time series analysis due to their versatile structure, simple recursive updating, ability to handle missing data, and probabilistic forecasting. However, the options for count time series are limited: Gaussian DLMs require continuous data, while Poisson-based alternatives often lack sufficient modeling flexibility. We introduce a novel semipar… ▽ More Dynamic Linear Models (DLMs) are commonly employed for time series analysis due to their versatile structure, simple recursive updating, ability to handle missing data, and probabilistic forecasting. However, the options for count time series are limited: Gaussian DLMs require continuous data, while Poisson-based alternatives often lack sufficient modeling flexibility. We introduce a novel semiparametric methodology for count time series by warping a Gaussian DLM. The warping function has two components: a (nonparametric) transformation operator that provides distributional flexibility and a rounding operator that ensures the correct support for the discrete data-generating process. We develop conjugate inference for the warped DLM, which enables analytic and recursive updates for the state space filtering and smoothing distributions. We leverage these results to produce customized and efficient algorithms for inference and forecasting, including Monte Carlo simulation for offline analysis and an optimal particle filter for online inference. This framework unifies and extends a variety of discrete time series models and is valid for natural counts, rounded values, and multivariate observations. Simulation studies illustrate the excellent forecasting capabilities of the warped DLM. The proposed approach is applied to a multivariate time series of daily overdose counts and demonstrates both modeling and computational successes. △ Less

Submitted 6 June, 2023; v1 submitted 27 October, 2021; originally announced October 2021.

arXiv:2110.12316 [pdf, other]

Semiparametric discrete data regression with Monte Carlo inference and prediction

Authors: Daniel R. Kowal, Bohan Wu

Abstract: Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over-/under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesia… ▽ More Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over-/under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesian modeling and algorithmic framework that enables semiparametric regression analysis for discrete data with Monte Carlo (not MCMC) sampling. The proposed approach pairs a nonparametric marginal model with a latent linear regression model to encourage both flexibility and interpretability, and delivers posterior consistency even under model misspecification. For a parametric or large-sample approximation of this model, we identify a class of conjugate priors with (pseudo) closed-form posteriors. All posterior and predictive distributions are available analytically or via direct Monte Carlo sampling. These tools are broadly useful for linear regression, nonlinear models via basis expansions, and variable selection with discrete data. Simulation studies demonstrate significant advantages in computing, prediction, estimation, and selection relative to existing alternatives. This novel approach is applied successfully to self-reported mental health data that exhibit zero-inflation, overdispersion, boundedness, and heaping. △ Less

Submitted 24 February, 2023; v1 submitted 23 October, 2021; originally announced October 2021.

arXiv:2108.02151 [pdf, other]

Semiparametric Functional Factor Models with Bayesian Rank Selection

Authors: Daniel R. Kowal, Antonio Canale

Abstract: Functional data are frequently accompanied by a parametric template that describes the typical shapes of the functions. However, these parametric templates can incur significant bias, which undermines both utility and interpretability. To correct for model misspecification, we augment the parametric template with an infinite-dimensional nonparametric functional basis. The nonparametric basis funct… ▽ More Functional data are frequently accompanied by a parametric template that describes the typical shapes of the functions. However, these parametric templates can incur significant bias, which undermines both utility and interpretability. To correct for model misspecification, we augment the parametric template with an infinite-dimensional nonparametric functional basis. The nonparametric basis functions are learned from the data and constrained to be orthogonal to the parametric template, which preserves distinctness between the parametric and nonparametric terms. This distinctness is essential to prevent functional confounding, which otherwise induces severe bias for the parametric terms. The nonparametric factors are regularized with an ordered spike-and-slab prior that provides consistent rank selection and satisfies several appealing theoretical properties. The versatility of the proposed approach is illustrated through applications to synthetic data, human motor control data, and dynamic yield curve data. Relative to parametric and semiparametric alternatives, the proposed semiparametric functional factor model eliminates bias, reduces excessive posterior and predictive uncertainty, and provides reliable inference on the effective number of nonparametric terms--all with minimal additional computational costs. △ Less

Submitted 16 May, 2022; v1 submitted 4 August, 2021; originally announced August 2021.

arXiv:2107.12890 [pdf, other]

Subset selection for linear mixed models

Authors: Daniel R. Kowal

Abstract: Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates--while accounting for this structured dependence--remains a challenge. We introduce a Bayesian decision analysis for subset selection with LMMs. Using a Mahalanobis loss function that incorporates the structured depe… ▽ More Linear mixed models (LMMs) are instrumental for regression analysis with structured dependence, such as grouped, clustered, or multilevel data. However, selection among the covariates--while accounting for this structured dependence--remains a challenge. We introduce a Bayesian decision analysis for subset selection with LMMs. Using a Mahalanobis loss function that incorporates the structured dependence, we derive optimal linear coefficients for (i) any given subset of variables and (ii) all subsets of variables that satisfy a cardinality constraint. Crucially, these estimates inherit shrinkage or regularization and uncertainty quantification from the underlying Bayesian model, and apply for any well-specified Bayesian LMM. More broadly, our decision analysis strategy deemphasizes the role of a single "best" subset, which is often unstable and limited in its information content, and instead favors a collection of near-optimal subsets. This collection is summarized by key member subsets and variable-specific importance metrics. Customized subset search and out-of-sample approximation algorithms are provided for more scalable computing. These tools are applied to simulated data and a longitudinal physical activity dataset, and demonstrate excellent prediction, estimation, and selection ability. △ Less

Submitted 18 April, 2022; v1 submitted 27 July, 2021; originally announced July 2021.

arXiv:2106.09114 [pdf, other]

Semiparametric count data regression for self-reported mental health

Authors: Daniel R. Kowal, Bohan Wu

Abstract: "For how many days during the past 30 days was your mental health not good?" The responses to this question measure self-reported mental health and can be linked to important covariates in the National Health and Nutrition Examination Survey (NHANES). However, these count variables present major distributional challenges: the data are overdispersed, zero-inflated, bounded by 30, and heaped in five… ▽ More "For how many days during the past 30 days was your mental health not good?" The responses to this question measure self-reported mental health and can be linked to important covariates in the National Health and Nutrition Examination Survey (NHANES). However, these count variables present major distributional challenges: the data are overdispersed, zero-inflated, bounded by 30, and heaped in five- and seven-day increments. To meet these challenges, we design a semiparametric estimation and inference framework for count data regression. The data-generating process is defined by simultaneously transforming and rounding (STAR) a latent Gaussian regression model. The transformation is estimated nonparametrically and the rounding operator ensures the correct support for the discrete and bounded data. Maximum likelihood estimators are computed using an EM algorithm that is compatible with any continuous data model estimable by least squares. STAR regression includes asymptotic hypothesis testing and confidence intervals, variable selection via information criteria, and customized diagnostics. Simulation studies validate the utility of this framework. STAR is deployed to study the factors associated with self-reported mental health and demonstrates substantial improvements in goodness-of-fit compared to existing count data regression models. △ Less

Submitted 13 October, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

arXiv:2104.10150 [pdf, other]

Bayesian subset selection and variable importance for interpretable prediction and classification

Authors: Daniel R. Kowal

Abstract: Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $\mathcal{M}$, we extract a family of near-o… ▽ More Subset selection is a valuable tool for interpretable learning, scientific discovery, and data compression. However, classical subset selection is often avoided due to selection instability, lack of regularization, and difficulties with post-selection inference. We address these challenges from a Bayesian perspective. Given any Bayesian predictive model $\mathcal{M}$, we extract a family of near-optimal subsets of variables for linear prediction or classification. This strategy deemphasizes the role of a single "best" subset and instead advances the broader perspective that often many subsets are highly competitive. The acceptable family of subsets offers a new pathway for model interpretation and is neatly summarized by key members such as the smallest acceptable subset, along with new (co-) variable importance metrics based on whether variables (co-) appear in all, some, or no acceptable subsets. More broadly, we apply Bayesian decision analysis to derive the optimal linear coefficients for any subset of variables. These coefficients inherit both regularization and predictive uncertainty quantification via $\mathcal{M}$. For both simulated and real data, the proposed approach exhibits better prediction, interval estimation, and variable selection than competing Bayesian and frequentist selection methods. These tools are applied to a large education dataset with highly correlated covariates. Our analysis provides unique insights into the combination of environmental, socioeconomic, and demographic factors that predict educational outcomes, and identifies over 200 distinct subsets of variables that offer near-optimal out-of-sample predictive accuracy. △ Less

Submitted 16 February, 2022; v1 submitted 20 April, 2021; originally announced April 2021.

arXiv:2102.08255 [pdf, other]

Bayesian Data Synthesis and the Utility-Risk Trade-Off for Mixed Epidemiological Data

Authors: Joseph Feldman, Daniel Kowal

Abstract: Much of the micro data used for epidemiological studies contain sensitive measurements on real individuals. As a result, such micro data cannot be published out of privacy concerns, rendering any published statistical analyses on them nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Ba… ▽ More Much of the micro data used for epidemiological studies contain sensitive measurements on real individuals. As a result, such micro data cannot be published out of privacy concerns, rendering any published statistical analyses on them nearly impossible to reproduce. To promote the dissemination of key datasets for analysis without jeopardizing the privacy of individuals, we introduce a cohesive Bayesian framework for the generation of fully synthetic, high dimensional micro datasets of mixed categorical, binary, count, and continuous variables. This process centers around a joint Bayesian model that is simultaneously compatible with all of these data types, enabling the creation of mixed synthetic datasets through posterior predictive sampling. Furthermore, a focal point of epidemiological data analysis is the study of conditional relationships between various exposures and key outcome variables through regression analysis. We design a modified data synthesis strategy to target and preserve these conditional relationships, including both nonlinearities and interactions. The proposed techniques are deployed to create a synthetic version of a confidential dataset containing dozens of health, cognitive, and social measurements on nearly 20,000 North Carolina children. △ Less

Submitted 19 January, 2022; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: 24 pages, 4 figures, 3 tables, accepted The Annals of Applied Statistics

arXiv:2006.13107 [pdf, other]

doi 10.1080/01621459.2021.1891926

Fast, Optimal, and Targeted Predictions using Parametrized Decision Analysis

Authors: Daniel R. Kowal

Abstract: Prediction is critical for decision-making under uncertainty and lends validity to statistical inference. With targeted prediction, the goal is to optimize predictions for specific decision tasks of interest, which we represent via functionals. Although classical decision analysis extracts predictions from a Bayesian model, these predictions are often difficult to interpret and slow to compute. In… ▽ More Prediction is critical for decision-making under uncertainty and lends validity to statistical inference. With targeted prediction, the goal is to optimize predictions for specific decision tasks of interest, which we represent via functionals. Although classical decision analysis extracts predictions from a Bayesian model, these predictions are often difficult to interpret and slow to compute. Instead, we design a class of parametrized actions for Bayesian decision analysis that produce optimal, scalable, and simple targeted predictions. For a wide variety of action parametrizations and loss functions--including linear actions with sparsity constraints for targeted variable selection--we derive a convenient representation of the optimal targeted prediction that yields efficient and interpretable solutions. Customized out-of-sample predictive metrics are developed to evaluate and compare among targeted predictors. Through careful use of the posterior predictive distribution, we introduce a procedure that identifies a set of near-optimal, or acceptable targeted predictors, which provide unique insights into the features and level of complexity needed for accurate targeted prediction. Simulations demonstrate excellent prediction, estimation, and variable selection capabilities. Targeted predictions are constructed for physical activity data from the National Health and Nutrition Examination Survey (NHANES) to better predict and understand the characteristics of intraday physical activity. △ Less

Submitted 10 October, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

arXiv:1906.11653 [pdf, other]

Simultaneous Transformation and Rounding (STAR) Models for Integer-Valued Data

Authors: Daniel R. Kowal, Antonio Canale

Abstract: We propose a simple yet powerful framework for modeling integer-valued data, such as counts, scores, and rounded data. The data-generating process is defined by Simultaneously Transforming and Rounding (STAR) a continuous-valued process, which produces a flexible family of integer-valued distributions capable of modeling zero-inflation, bounded or censored data, and over- or underdispersion. The t… ▽ More We propose a simple yet powerful framework for modeling integer-valued data, such as counts, scores, and rounded data. The data-generating process is defined by Simultaneously Transforming and Rounding (STAR) a continuous-valued process, which produces a flexible family of integer-valued distributions capable of modeling zero-inflation, bounded or censored data, and over- or underdispersion. The transformation is modeled as unknown for greater distributional flexibility, while the rounding operation ensures a coherent integer-valued data-generating process. An efficient MCMC algorithm is developed for posterior inference and provides a mechanism for adaptation of successful Bayesian models and algorithms for continuous data to the integer-valued data setting. Using the STAR framework, we design a new Bayesian Additive Regression Tree (BART) model for integer-valued data, which demonstrates impressive predictive distribution accuracy for both synthetic data and a large healthcare utilization dataset. For interpretable regression-based inference, we develop a STAR additive model, which offers greater flexibility and scalability than existing integer-valued models. The STAR additive model is applied to study the recent decline in Amazon river dolphins. △ Less

Submitted 3 September, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

arXiv:1902.07788 [pdf, other]

doi 10.1111/biom.13110

Integer-Valued Functional Data Analysis for Measles Forecasting

Authors: Daniel R. Kowal

Abstract: Measles presents a unique and imminent challenge for epidemiologists and public health officials: the disease is highly contagious, yet vaccination rates are declining precipitously in many localities. Consequently, the risk of a measles outbreak continues to rise. To improve preparedness, we study historical measles data both pre- and post-vaccine, and design new methodology to forecast measles c… ▽ More Measles presents a unique and imminent challenge for epidemiologists and public health officials: the disease is highly contagious, yet vaccination rates are declining precipitously in many localities. Consequently, the risk of a measles outbreak continues to rise. To improve preparedness, we study historical measles data both pre- and post-vaccine, and design new methodology to forecast measles counts with uncertainty quantification. We propose to model the disease counts as an integer-valued functional time series: measles counts are a function of time-of-year and time-ordered by year. The counts are modeled using a negative-binomial distribution conditional on a real-valued latent process, which accounts for the overdispersion observed in the data. The latent process is decomposed using an unknown basis expansion, which is learned from the data, with dynamic basis coefficients. The resulting framework provides enhanced capability to model complex seasonality, which varies dynamically from year-to-year, and offers improved multi-month ahead point forecasts and substantially tighter forecast intervals (with correct coverage) compared to existing forecasting models. Importantly, the fully Bayesian approach provides well-calibrated and precise uncertainty quantification for epi-relevent features, such as the future value and time of the peak measles count in a given year. An R package is available online. △ Less

Submitted 20 February, 2019; originally announced February 2019.

arXiv:1808.06689 [pdf, other]

Bayesian Function-on-Scalars Regression for High Dimensional Data

Authors: Daniel R. Kowal, Daniel C. Bourgeois

Abstract: We develop a fully Bayesian framework for function-on-scalars regression with many predictors. The functional data response is modeled nonparametrically using unknown basis functions, which produces a flexible and data-adaptive functional basis. We incorporate shrinkage priors that effectively remove unimportant scalar covariates from the model and reduce sensitivity to the number of (unknown) bas… ▽ More We develop a fully Bayesian framework for function-on-scalars regression with many predictors. The functional data response is modeled nonparametrically using unknown basis functions, which produces a flexible and data-adaptive functional basis. We incorporate shrinkage priors that effectively remove unimportant scalar covariates from the model and reduce sensitivity to the number of (unknown) basis functions. For variable selection in functional regression, we propose a decision theoretic posterior summarization technique, which identifies a subset of covariates that retains nearly the predictive accuracy of the full model. Our approach is broadly applicable for Bayesian functional regression models, and unlike existing methods provides joint rather than marginal selection of important predictor variables. Computationally scalable posterior inference is achieved using a Gibbs sampler with linear time complexity in the number of predictors. The resulting algorithm is empirically faster than existing frequentist and Bayesian techniques, and provides joint estimation of model parameters, prediction and imputation of functional trajectories, and uncertainty quantification via the posterior distribution. A simulation study demonstrates improvements in estimation accuracy, uncertainty quantification, and variable selection relative to existing alternatives. The methodology is applied to actigraphy data to investigate the association between intraday physical activity and responses to a sleep questionnaire. △ Less

Submitted 23 October, 2018; v1 submitted 20 August, 2018; originally announced August 2018.

arXiv:1806.01460 [pdf, other]

Dynamic Function-on-Scalars Regression

Authors: Daniel R. Kowal

Abstract: We develop a modeling framework for dynamic function-on-scalars regression, in which a time series of functional data is regressed on a time series of scalar predictors. The regression coefficient function for each predictor is allowed to be dynamic, which is essential for applications where the association between predictors and a (functional) response is time-varying. For greater modeling flexib… ▽ More We develop a modeling framework for dynamic function-on-scalars regression, in which a time series of functional data is regressed on a time series of scalar predictors. The regression coefficient function for each predictor is allowed to be dynamic, which is essential for applications where the association between predictors and a (functional) response is time-varying. For greater modeling flexibility, we design a nonparametric reduced-rank functional data model with an unknown functional basis expansion, which is data-adaptive and, unlike most existing methods, modeled as unknown for appropriate uncertainty quantification. Within a Bayesian framework, we introduce shrinkage priors that simultaneously (i) regularize time-varying regression coefficient functions to be locally static, (ii) effectively remove unimportant predictor variables from the model, and (iii) reduce sensitivity to the dimension of the functional basis. A simulation analysis confirms the importance of these shrinkage priors, with notable improvements over existing alternatives. We develop a novel projection-based Gibbs sampling algorithm, which offers unrivaled computational scalability for fully Bayesian functional regression. We apply the proposed methodology (i) to analyze the time-varying impact of macroeconomic variables on the U.S. yield curve and (ii) to characterize the effects of socioeconomic and demographic predictors on age-specific fertility rates in South and Southeast Asia. △ Less

Submitted 23 October, 2018; v1 submitted 4 June, 2018; originally announced June 2018.

arXiv:1707.00763 [pdf, other]

doi 10.1111/rssb.12325

Dynamic Shrinkage Processes

Authors: Daniel R. Kowal, David S. Matteson, David Ruppert

Abstract: We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building upon a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence among the local scale parameters. The resulting processes inherit… ▽ More We propose a novel class of dynamic shrinkage processes for Bayesian time series and regression analysis. Building upon a global-local framework of prior construction, in which continuous scale mixtures of Gaussian distributions are employed for both desirable shrinkage properties and computational tractability, we model dependence among the local scale parameters. The resulting processes inherit the desirable shrinkage behavior of popular global-local priors, such as the horseshoe prior, but provide additional localized adaptivity, which is important for modeling time series data or regression functions with local features. We construct a computationally efficient Gibbs sampling algorithm based on a Pólya-Gamma scale mixture representation of the proposed process. Using dynamic shrinkage processes, we develop a Bayesian trend filtering model that produces more accurate estimates and tighter posterior credible intervals than competing methods, and apply the model for irregular curve-fitting of minute-by-minute Twitter CPU usage data. In addition, we develop an adaptive time-varying parameter regression model to assess the efficacy of the Fama-French five-factor asset pricing model with momentum added as a sixth factor. Our dynamic analysis of manufacturing and healthcare industry data shows that with the exception of the market risk, no other risk factors are significant except for brief periods. △ Less

Submitted 23 February, 2018; v1 submitted 3 July, 2017; originally announced July 2017.

arXiv:1603.02982 [pdf, other]

doi 10.1080/07350015.2017.1279058

Functional Autoregression for Sparsely Sampled Data

Authors: Daniel R. Kowal, David S. Matteson, David Ruppert

Abstract: We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose… ▽ More We develop a hierarchical Gaussian process model for forecasting and inference of functional time series data. Unlike existing methods, our approach is especially suited for sparsely or irregularly sampled curves and for curves sampled with non-negligible measurement error. The latent process is dynamically modeled as a functional autoregression (FAR) with Gaussian process innovations. We propose a fully nonparametric dynamic functional factor model for the dynamic innovation process, with broader applicability and improved computational efficiency over standard Gaussian process models. We prove finite-sample forecasting and interpolation optimality properties of the proposed model, which remain valid with the Gaussian assumption relaxed. An efficient Gibbs sampling algorithm is developed for estimation, inference, and forecasting, with extensions for FAR(p) models with model averaging over the lag p. Extensive simulations demonstrate substantial improvements in forecasting performance and recovery of the autoregressive surface over competing methods, especially under sparse designs. We apply the proposed methods to forecast nominal and real yield curves using daily U.S. data. Real yields are observed more sparsely than nominal yields, yet the proposed methods are highly competitive in both settings. △ Less

Submitted 19 October, 2016; v1 submitted 9 March, 2016; originally announced March 2016.

arXiv:1411.0764 [pdf, other]

doi 10.1080/01621459.2016.1165104

A Bayesian Multivariate Functional Dynamic Linear Model

Authors: Daniel R. Kowal, David S. Matteson, David Ruppert

Abstract: We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data--functional, time dependent, and multivariate components--we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization fra… ▽ More We present a Bayesian approach for modeling multivariate, dependent functional data. To account for the three dominant structural features in the data--functional, time dependent, and multivariate components--we extend hierarchical dynamic linear models for multivariate time series to the functional data setting. We also develop Bayesian spline theory in a more general constrained optimization framework. The proposed methods identify a time-invariant functional basis for the functional observations, which is smooth and interpretable, and can be made common across multivariate observations for additional information sharing. The Bayesian framework permits joint estimation of the model parameters, provides exact inference (up to MCMC error) on specific parameters, and allows generalized dependence structures. Sampling from the posterior distribution is accomplished with an efficient Gibbs sampling algorithm. We illustrate the proposed framework with two applications: (1) multi-economy yield curve data from the recent global recession, and (2) local field potential brain signals in rats, for which we develop a multivariate functional time series approach for multivariate time-frequency analysis. Supplementary materials, including R code and the multi-economy yield curve data, are available online. △ Less

Submitted 5 August, 2015; v1 submitted 3 November, 2014; originally announced November 2014.

arXiv:0708.2751 [pdf, ps, other]

doi 10.1016/j.physc.2007.07.012

Scale dependent superconductor-insulator transition

Authors: D. Kowal, Z. Ovadyahu

Abstract: We study the disorder driven superconductor to insulator transition in amorphous films of high carrier-concentration indium-oxide. Using thin films with various sizes and aspect ratios we show that the `critical' sheet-resistance $R_{\small \square}$ depends systematically on sample geometry; superconductivity disappears when $R_{\small \square}$ exceeds $\approx6 $k$Ω$ in large samples. On the… ▽ More We study the disorder driven superconductor to insulator transition in amorphous films of high carrier-concentration indium-oxide. Using thin films with various sizes and aspect ratios we show that the `critical' sheet-resistance $R_{\small \square}$ depends systematically on sample geometry; superconductivity disappears when $R_{\small \square}$ exceeds $\approx6 $k$Ω$ in large samples. On the other hand, wide and sufficiently short samples of the same batch exhibit superconductivity (judged by conductivity versus temperature) up to $R_{\small \square}$ which is considerably larger. These results support the inhomogeneous scenario for the superconductor-insulator transition. △ Less

Submitted 20 August, 2007; originally announced August 2007.

Comments: Contribution to the proceedings of "Fluctuations and phase transitions in superconductors", Nazareth Ilit, Israel, June 10-14, 2007

Showing 1–29 of 29 results for author: Kowal, D