-
Monotonic warpings for additive and deep Gaussian processes
Authors:
Steven D. Barnett,
Lauren J. Beesley,
Annie S. Booth,
Robert B. Gramacy,
Dave Osthus
Abstract:
Gaussian processes (GPs) are canonical as surrogates for computer experiments because they enjoy a degree of analytic tractability. But that breaks when the response surface is constrained, say to be monotonic. Here, we provide a mono-GP construction for a single input that is highly efficient even though the calculations are non-analytic. Key ingredients include transformation of a reference proc…
▽ More
Gaussian processes (GPs) are canonical as surrogates for computer experiments because they enjoy a degree of analytic tractability. But that breaks when the response surface is constrained, say to be monotonic. Here, we provide a mono-GP construction for a single input that is highly efficient even though the calculations are non-analytic. Key ingredients include transformation of a reference process and elliptical slice sampling. We then show how mono-GP may be deployed effectively in two ways. One is additive, extending monotonicity to more inputs; the other is as a prior on injective latent warping variables in a deep Gaussian process for (non-monotonic, multi-input) non-stationary surrogate modeling. We provide illustrative and benchmarking examples throughout, showing that our methods yield improved performance over the state-of-the-art on examples from those two classes of problems.
△ Less
Submitted 10 March, 2025; v1 submitted 2 August, 2024;
originally announced August 2024.
-
Mapping Incidence and Prevalence Peak Data for SIR Forecasting Applications
Authors:
Alexander C. Murph,
G. Casey Gibson,
Lauren J. Beesley,
Nishant Panda,
Lauren A. Castro,
Sara Y. Del Valle,
Dave Osthus
Abstract:
Infectious disease modeling and forecasting have played a key role in helping assess and respond to epidemics and pandemics. Recent work has leveraged data on disease peak infection and peak hospital incidence to fit compartmental models for the purpose of forecasting and describing the dynamics of a disease outbreak. Incorporating these data can greatly stabilize a compartmental model fit on earl…
▽ More
Infectious disease modeling and forecasting have played a key role in helping assess and respond to epidemics and pandemics. Recent work has leveraged data on disease peak infection and peak hospital incidence to fit compartmental models for the purpose of forecasting and describing the dynamics of a disease outbreak. Incorporating these data can greatly stabilize a compartmental model fit on early observations, where slight perturbations in the data may lead to model fits that project wildly unrealistic peak infection. We introduce a new method for incorporating historic data on the value and time of peak incidence of hospitalization into the fit for a Susceptible-Infectious-Recovered (SIR) model by formulating the relationship between an SIR model's starting parameters and peak incidence as a system of two equations that can be solved computationally. This approach is assessed for practicality in terms of accuracy and speed of computation via simulation. To exhibit the modeling potential, we update the Dirichlet-Beta State Space modeling framework to use hospital incidence data, as this framework was previously formulated to incorporate only data on total infections.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Moving Towards Automated Interstellar Boundary Explorer Data Selection with LOTUS
Authors:
Madeline A. Stricklin,
Lauren J. Beesley,
Brian P. Weaver,
Kelly R. Moran,
Dave Osthus,
Paul H. Janzen,
Grant David Meadors,
Daniel B. Reisenfeld
Abstract:
The Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that provide insight into the heliosphere, the region surrounding our solar system and separating it from interstellar space. IBEX collects information on these particles and on extraneous ``background'' particles. While IBEX records how and when the different particles are observed, it does not dis…
▽ More
The Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that provide insight into the heliosphere, the region surrounding our solar system and separating it from interstellar space. IBEX collects information on these particles and on extraneous ``background'' particles. While IBEX records how and when the different particles are observed, it does not distinguish between heliospheric ENA particles and incidental background particles. To address this issue, all IBEX data has historically been manually labeled as ``good'' ENA data, or ``bad'' background data. This manual culling process is incredibly time-intensive and contingent on subjective, manually-induced decision thresholds. In this paper, we develop a three-stage automated culling process, called LOTUS, that uses random forests to expedite and standardize the labelling process. In Stage 1, LOTUS uses random forests to obtain probabilities of observing true ENA particles on a per-observation basis. In Stage 2, LOTUS aggregates these probabilities to obtain predictions within small windows of time. In Stage 3, LOTUS refines these predictions. We compare the labels generated by LOTUS to those manually generated by the subject matter expert. We use various metrics to demonstrate that LOTUS is a useful automated process for supplementing and standardizing the manual culling process.
△ Less
Submitted 21 March, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Empirical Validation of a New Data Product from the Interstellar Boundary Explorer Satellite
Authors:
Kelly R. Moran,
Dave Osthus,
Brian P. Weaver,
Lauren J. Beesley,
Madeline A. Stricklin,
Paul H. Janzen,
Daniel B. Reisenfeld
Abstract:
Since 2008, the Interstellar Boundary Explorer (IBEX) satellite has been gathering data on heliospheric energetic neutral atoms (ENAs) while being exposed to various sources of background noise, such as cosmic rays and solar energetic particles. The IBEX mission initially released only a qualified triple-coincidence (qABC) data product, which was designed to provide observations of ENAs free of ba…
▽ More
Since 2008, the Interstellar Boundary Explorer (IBEX) satellite has been gathering data on heliospheric energetic neutral atoms (ENAs) while being exposed to various sources of background noise, such as cosmic rays and solar energetic particles. The IBEX mission initially released only a qualified triple-coincidence (qABC) data product, which was designed to provide observations of ENAs free of background contamination. Further measurements revealed that the qABC data was in fact susceptible to contamination, having relatively low ENA counts and high background rates. Recently, the mission team considered releasing a certain qualified double-coincidence (qBC) data product, which has roughly twice the detection rate of the qABC data product. This paper presents a simulation-based validation of the new qBC data product against the already-released qABC data product. The results show that the qBCs can plausibly be said to share the same signal rate as the qABCs up to an average absolute deviation of 3.6%. Visual diagnostics at an orbit, map, and full mission level provide additional confirmation of signal rate coherence across data products. These approaches are generalizable to other scenarios in which one wishes to test whether multiple observations could plausibly be generated by some underlying shared signal.
△ Less
Submitted 28 November, 2023;
originally announced November 2023.
-
Statistical methods for partitioning ribbon and globally-distributed flux using data from the Interstellar Boundary Explorer
Authors:
Lauren J. Beesley,
Dave Osthus,
Kelly R. Moran,
Madeline A. Ausdemore,
Grant David Meadors,
Paul H. Janzen,
Eric J. Zirnstein,
Brian P. Weaver,
Daniel B. Reisenfeld
Abstract:
ASA's Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that can provide insight into the heliosphere boundary between our solar system and interstellar space. Using these data, scientists can construct maps of the ENA intensities (often, expressed in terms of flux) observed in all directions. The ENA flux observed in these maps is believed to come fro…
▽ More
ASA's Interstellar Boundary Explorer (IBEX) satellite collects data on energetic neutral atoms (ENAs) that can provide insight into the heliosphere boundary between our solar system and interstellar space. Using these data, scientists can construct maps of the ENA intensities (often, expressed in terms of flux) observed in all directions. The ENA flux observed in these maps is believed to come from at least two distinct sources: one source which manifests as a ribbon of concentrated ENA flux and one source (or possibly several) that manifest as smoothly-varying globally-distributed flux. Each ENA source type and its corresponding ENA intensity map is of separate scientific interest. In this paper, we develop statistical methods for separating the total ENA intensity maps into two source-specific maps (ribbon and globally-distributed flux) and estimating corresponding uncertainty. Key advantages of the proposed method include enhanced model flexibility and improved propagation of estimation uncertainty. We evaluate the proposed methods on simulated data designed to mimic realistic data settings. We also propose new methods for estimating the center of the near-elliptical ribbon in the sky, which can be used in the future to study the location and variation of the local interstellar magnetic field.
△ Less
Submitted 6 February, 2023;
originally announced February 2023.
-
Towards Improved Heliosphere Sky Map Estimation with Theseus
Authors:
Dave Osthus,
Brian P. Weaver,
Lauren J. Beesley,
Kelly R. Moran,
Madeline A. Ausdemore,
Eric J. Zirnstein,
Paul H. Janzen,
Daniel B. Reisenfeld
Abstract:
The Interstellar Boundary Explorer (IBEX) satellite has been in orbit since 2008 and detects energy-resolved energetic neutral atoms (ENAs) originating from the heliosphere. Different regions of the heliosphere generate ENAs at different rates. It is of scientific interest to take the data collected by IBEX and estimate spatial maps of heliospheric ENA rates (referred to as sky maps) at higher res…
▽ More
The Interstellar Boundary Explorer (IBEX) satellite has been in orbit since 2008 and detects energy-resolved energetic neutral atoms (ENAs) originating from the heliosphere. Different regions of the heliosphere generate ENAs at different rates. It is of scientific interest to take the data collected by IBEX and estimate spatial maps of heliospheric ENA rates (referred to as sky maps) at higher resolutions than before. These sky maps will subsequently be used to discern between competing theories of heliosphere properties that are not currently possible. The data IBEX collects present challenges to sky map estimation. The two primary challenges are noisy and irregularly spaced data collection and the IBEX instrumentation's point spread function. In essence, the data collected by IBEX are both noisy and biased for the underlying sky map of inferential interest. In this paper, we present a two-stage sky map estimation procedure called Theseus. In Stage 1, Theseus estimates a blurred sky map from the noisy and irregularly spaced data using an ensemble approach that leverages projection pursuit regression and generalized additive models. In Stage 2, Theseus deblurs the sky map by deconvolving the PSF with the blurred map using regularization. Unblurred sky map uncertainties are computed via bootstrapping. We compare Theseus to a method closely related to the one operationally used today by the IBEX Science Operation Center (ISOC) on both simulated and real data. Theseus outperforms ISOC in nearly every considered metric on simulated data, indicating that Theseus is an improvement over the current state of the art.
△ Less
Submitted 20 October, 2022;
originally announced October 2022.
-
Addressing delayed case reporting in infectious disease forecast modeling
Authors:
Lauren J Beesley,
Dave Osthus,
Sara Y Del Valle
Abstract:
Infectious disease forecasting is of great interest to the public health community and policymakers, since forecasts can provide insight into disease dynamics in the near future and inform interventions. Due to delays in case reporting, however, forecasting models may often underestimate the current and future disease burden.
In this paper, we propose a general framework for addressing reporting…
▽ More
Infectious disease forecasting is of great interest to the public health community and policymakers, since forecasts can provide insight into disease dynamics in the near future and inform interventions. Due to delays in case reporting, however, forecasting models may often underestimate the current and future disease burden.
In this paper, we propose a general framework for addressing reporting delay in disease forecasting efforts with the goal of improving forecasts. We propose strategies for leveraging either historical data on case reporting or external internet-based data to estimate the amount of reporting error. We then describe several approaches for adapting general forecasting pipelines to account for under- or over-reporting of cases. We apply these methods to address reporting delay in data on dengue fever cases in Puerto Rico from 1990 to 2009 and to reports of influenza-like illness (ILI) in the United States between 2010 and 2019. Through a simulation study, we compare method performance and evaluate robustness to assumption violations. Our results show that forecasting accuracy and prediction coverage almost always increase when correction methods are implemented to address reporting delay. Some of these methods required knowledge about the reporting error or high quality external data, which may not always be available. Provided alternatives include excluding recently-reported data and performing sensitivity analysis. This work provides intuition and guidance for handling delay in disease case reporting and may serve as a useful resource to inform practical infectious disease forecasting efforts.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
Reconstructing magnetic deflections from sets of proton images using differential evolution
Authors:
Joseph M. Levesque,
Lauren J. Beesley
Abstract:
Proton imaging is a powerful technique for imaging electromagnetic fields within an experimental volume, in which spatial variations in proton fluence are a result of deflections to proton trajectories due to interaction with the fields. When deflections are large, proton trajectories can overlap, and this nonlinearity creates regions of greatly increased proton fluence on the image, known as caus…
▽ More
Proton imaging is a powerful technique for imaging electromagnetic fields within an experimental volume, in which spatial variations in proton fluence are a result of deflections to proton trajectories due to interaction with the fields. When deflections are large, proton trajectories can overlap, and this nonlinearity creates regions of greatly increased proton fluence on the image, known as caustics. The formation of caustics has been a persistent barrier to reconstructing the underlying fields from proton images. We have developed a new method for reconstructing the path-integrated magnetic fields which begins to address the problem posed by caustics. Our method uses multiple proton images of the same object, each image at a different energy, to fill in the information gaps and provide some uniqueness when reconstructing caustic features. We use a differential evolution algorithm to iteratively estimate the underlying deflection function which accurately reproduces the observed proton fluence at multiple proton energies simultaneously. We test this reconstruction method using synthetic proton images generated for three different, cylindrically symmetric field geometries at various field amplitudes and levels of proton statistics, and present reconstruction results from a set of experimental images. The method we propose requires no assumption of deflection linearity and can reliably solve for fields underlying linear, nonlinear, and caustic proton image features for the selected geometries, and is shown to be fairly robust to noise in the input proton intensity.
△ Less
Submitted 9 September, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
Multiple imputation with missing data indicators
Authors:
Lauren J Beesley,
Irina Bondarenko,
Michael R Elliott,
Allison W Kurian,
Steven J Katz,
Jeremy M G Taylor
Abstract:
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation (SRMI), also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approac…
▽ More
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation (SRMI), also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the SRMI imputation procedure to handle not-at-random missingness (MNAR) in the setting where missingness may depend on other variables that are also missing. We provide algebraic justification for several generalizations of standard SRMI using Taylor series and other approximations of the target imputation distribution under MNAR. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the MNAR missingness model and observed data. In a simulation study, we demonstrate that the proposed SRMI modifications result in reduced bias in the final analysis compared to standard SRMI, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
△ Less
Submitted 2 March, 2021;
originally announced March 2021.
-
Accounting for not-at-random missingness through imputation stacking
Authors:
Lauren J Beesley,
Jeremy M G Taylor
Abstract:
Not-at-random missingness presents a challenge in addressing missing data in many health research applications. In this paper, we propose a new approach to account for not-at-random missingness after multiple imputation through weighted analysis of stacked multiple imputations. The weights are easily calculated as a function of the imputed data and assumptions about the not-at-random missingness.…
▽ More
Not-at-random missingness presents a challenge in addressing missing data in many health research applications. In this paper, we propose a new approach to account for not-at-random missingness after multiple imputation through weighted analysis of stacked multiple imputations. The weights are easily calculated as a function of the imputed data and assumptions about the not-at-random missingness. We demonstrate through simulation that the proposed method has excellent performance when the missingness model is correctly specified. In practice, the missingness mechanism will not be known. We show how we can use our approach in a sensitivity analysis framework to evaluate the robustness of model inference to different assumptions about the missingness mechanism, and we provide R package StackImpute to facilitate implementation as part of routine sensitivity analyses. We apply the proposed method to account for not-at-random missingness in human papillomavirus test results in a study of survival for patients diagnosed with oropharyngeal cancer.
△ Less
Submitted 19 January, 2021;
originally announced January 2021.
-
Patient Recruitment Using Electronic Health Records Under Selection Bias: a Two-phase Sampling Framework
Authors:
Guanghao Zhang,
Lauren J. Beesley,
Bhramar Mukherjee,
Xu Shi
Abstract:
Electronic health records (EHRs) are increasingly recognized as a cost-effective resource for patient recruitment in clinical research. However, how to optimally select a cohort from millions of individuals to answer a scientific question of interest remains unclear. Consider a study to estimate the mean or mean difference of an expensive outcome. Inexpensive auxiliary covariates predictive of the…
▽ More
Electronic health records (EHRs) are increasingly recognized as a cost-effective resource for patient recruitment in clinical research. However, how to optimally select a cohort from millions of individuals to answer a scientific question of interest remains unclear. Consider a study to estimate the mean or mean difference of an expensive outcome. Inexpensive auxiliary covariates predictive of the outcome may often be available in patients' health records, presenting an opportunity to recruit patients selectively which may improve efficiency in downstream analyses. In this paper, we propose a two-phase sampling design that leverages available information on auxiliary covariates in EHR data. A key challenge in using EHR data for multi-phase sampling is the potential selection bias, because EHR data are not necessarily representative of the target population. Extending existing literature on two-phase sampling design, we derive an optimal two-phase sampling method that improves efficiency over random sampling while accounting for the potential selection bias in EHR data. We demonstrate the efficiency gain from our sampling design via simulation studies and an application to evaluating the prevalence of hypertension among US adults leveraging data from the Michigan Genomics Initiative, a longitudinal biorepository in Michigan Medicine.
△ Less
Submitted 13 December, 2023; v1 submitted 12 November, 2020;
originally announced November 2020.
-
Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods
Authors:
Jiacong Du,
Jonathan Boss,
Peisong Han,
Lauren J Beesley,
Stephen A Goutman,
Stuart Batterman,
Eva L Feldman,
Bhramar Mukherjee
Abstract:
Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will lik…
▽ More
Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. In this paper we consider a general class of penalized objective functions which, by construction, force selection of the same variables across multiply-imputed datasets. By pooling objective functions across imputations, optimization is then performed jointly over all imputed datasets rather than separately for each dataset. We consider two objective function formulations that exist in the literature, which we will refer to as "stacked" and "grouped" objective functions. Building on existing work, we (a) derive and implement efficient cyclic coordinate descent and majorization-minimization optimization algorithms for both continuous and binary outcome data, (b) incorporate adaptive shrinkage penalties, (c) compare these methods through simulation, and (d) develop an R package miselect for easy implementation. Simulations demonstrate that the "stacked" objective function approaches tend to be more computationally efficient and have better estimation and selection properties. We apply these methods to data from the University of Michigan ALS Patients Repository (UMAPR) which aims to identify the association between persistent organic pollutants and ALS risk.
△ Less
Submitted 16 March, 2020;
originally announced March 2020.