-
Probabilistic Emulation of the Community Radiative Transfer Model Using Machine Learning
Authors:
Lucas Howard,
Aneesh C. Subramanian,
Gregory Thompson,
Benjamin Johnson,
Thomas Auligne
Abstract:
The continuous improvement in weather forecast skill over the past several decades is largely due to the increasing quantity of available satellite observations and their assimilation into operational forecast systems. Assimilating these observations requires observation operators in the form of radiative transfer models. Significant efforts have been dedicated to enhancing the computational effic…
▽ More
The continuous improvement in weather forecast skill over the past several decades is largely due to the increasing quantity of available satellite observations and their assimilation into operational forecast systems. Assimilating these observations requires observation operators in the form of radiative transfer models. Significant efforts have been dedicated to enhancing the computational efficiency of these models. Computational cost remains a bottleneck, and a large fraction of available data goes unused for assimilation. To address this, we used machine learning to build an efficient neural network based probabilistic emulator of the Community Radiative Transfer Model (CRTM), applied to the GOES Advanced Baseline Imager. The trained NN emulator predicts brightness temperatures output by CRTM and the corresponding error with respect to CRTM. RMSE of the predicted brightness temperature is 0.3 K averaged across all channels. For clear sky conditions, the RMSE is less than 0.1 K for 9 out of 10 infrared channels. The error predictions are generally reliable across a wide range of conditions. Explainable AI methods demonstrate that the trained emulator reproduces the relevant physics, increasing confidence that the model will perform well when presented with new data.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
Constructing optimal treatment length strategies to maximize quality-adjusted lifetimes
Authors:
Hao Sun,
Ashkan Ertefaie,
Luke Duttweiler,
Brent A. Johnson
Abstract:
Real-world clinical decision making is a complex process that involves balancing the risks and benefits of treatments. Quality-adjusted lifetime is a composite outcome that combines patient quantity and quality of life, making it an attractive outcome in clinical research. We propose methods for constructing optimal treatment length strategies to maximize this outcome. Existing methods for estimat…
▽ More
Real-world clinical decision making is a complex process that involves balancing the risks and benefits of treatments. Quality-adjusted lifetime is a composite outcome that combines patient quantity and quality of life, making it an attractive outcome in clinical research. We propose methods for constructing optimal treatment length strategies to maximize this outcome. Existing methods for estimating optimal treatment strategies for survival outcomes cannot be applied to a quality-adjusted lifetime due to induced informative censoring. We propose a weighted estimating equation that adjusts for both confounding and informative censoring. We also propose a nonparametric estimator of the mean counterfactual quality-adjusted lifetime survival curve under a given treatment length strategy, where the weights are estimated using an undersmoothed sieve-based estimator. We show that the estimator is asymptotically linear and provide a data-dependent undersmoothing criterion. We apply our method to obtain the optimal time for percutaneous endoscopic gastrostomy insertion in patients with amyotrophic lateral sclerosis.
△ Less
Submitted 6 December, 2024;
originally announced December 2024.
-
Quantile Slice Sampling
Authors:
Matthew J. Heiner,
Samuel B. Johnson,
Joshua R. Christensen,
David B. Dahl
Abstract:
We propose and demonstrate a novel, effective approach to slice sampling. Using the probability integral transform, we first generalize Neal's shrinkage algorithm, standardizing the procedure to an automatic and universal starting point: the unit interval. This enables the introduction of approximate (pseudo-) targets through the factorization used in importance sampling, a technique that populari…
▽ More
We propose and demonstrate a novel, effective approach to slice sampling. Using the probability integral transform, we first generalize Neal's shrinkage algorithm, standardizing the procedure to an automatic and universal starting point: the unit interval. This enables the introduction of approximate (pseudo-) targets through the factorization used in importance sampling, a technique that popularized elliptical slice sampling, while still sampling from the correct target distribution. Accurate pseudo-targets can boost sampler efficiency by requiring fewer rejections and by reducing skewness in the transformed target. This strategy is effective when a natural, possibly crude approximation to the target exists. Alternatively, obtaining a marginal pseudo-target from initial samples provides an intuitive and automatic tuning procedure. We consider two metrics for evaluating the quality of approximation; each can be used as a criterion to find an optimal pseudo-target or as an interpretable diagnostic. We examine performance of the proposed sampler relative to other popular, easily implemented MCMC samplers on standard targets in isolation, and as steps within a Gibbs sampler in a Bayesian modeling context. We extend the transformation method to multivariate slice samplers and demonstrate with a constrained state-space model for which a readily available forward-backward algorithm provides the target approximation. Supplemental materials and accompanying R package qslice are available online.
△ Less
Submitted 13 June, 2025; v1 submitted 17 July, 2024;
originally announced July 2024.
-
Nonparametric estimation of a covariate-adjusted counterfactual treatment regimen response curve
Authors:
Ashkan Ertefaie,
Luke Duttweiler,
Brent A. Johnson,
Mark J. van der Laan
Abstract:
Flexible estimation of the mean outcome under a treatment regimen (i.e., value function) is the key step toward personalized medicine. We define our target parameter as a conditional value function given a set of baseline covariates which we refer to as a stratum based value function. We focus on semiparametric class of decision rules and propose a sieve based nonparametric covariate adjusted regi…
▽ More
Flexible estimation of the mean outcome under a treatment regimen (i.e., value function) is the key step toward personalized medicine. We define our target parameter as a conditional value function given a set of baseline covariates which we refer to as a stratum based value function. We focus on semiparametric class of decision rules and propose a sieve based nonparametric covariate adjusted regimen-response curve estimator within that class. Our work contributes in several ways. First, we propose an inverse probability weighted nonparametrically efficient estimator of the smoothed regimen-response curve function. We show that asymptotic linearity is achieved when the nuisance functions are undersmoothed sufficiently. Asymptotic and finite sample criteria for undersmoothing are proposed. Second, using Gaussian process theory, we propose simultaneous confidence intervals for the smoothed regimen-response curve function. Third, we provide consistency and convergence rate for the optimizer of the regimen-response curve estimator; this enables us to estimate an optimal semiparametric rule. The latter is important as the optimizer corresponds with the optimal dynamic treatment regimen. Some finite-sample properties are explored with simulations.
△ Less
Submitted 27 September, 2023;
originally announced September 2023.
-
Gotta match 'em all: Solution diversification in graph matching matched filters
Authors:
Zhirui Li,
Ben Johnson,
Daniel L. Sussman,
Carey E. Priebe,
Vince Lyzinski
Abstract:
We present a novel approach for finding multiple noisily embedded template graphs in a very large background graph. Our method builds upon the graph-matching-matched-filter technique proposed in Sussman et al., with the discovery of multiple diverse matchings being achieved by iteratively penalizing a suitable node-pair similarity matrix in the matched filter algorithm. In addition, we propose alg…
▽ More
We present a novel approach for finding multiple noisily embedded template graphs in a very large background graph. Our method builds upon the graph-matching-matched-filter technique proposed in Sussman et al., with the discovery of multiple diverse matchings being achieved by iteratively penalizing a suitable node-pair similarity matrix in the matched filter algorithm. In addition, we propose algorithmic speed-ups that greatly enhance the scalability of our matched-filter approach. We present theoretical justification of our methodology in the setting of correlated Erdos-Renyi graphs, showing its ability to sequentially discover multiple templates under mild model conditions. We additionally demonstrate our method's utility via extensive experiments both using simulated models and real-world dataset, include human brain connectomes and a large transactional knowledge base.
△ Less
Submitted 4 July, 2024; v1 submitted 25 August, 2023;
originally announced August 2023.
-
A non-parametric Bayesian approach for adjusting partial compliance in sequential decision making
Authors:
Indrabati Bhattacharya,
Brent A. Johnson,
William Artman,
Andrew Wilson,
Kevin G. Lynch,
James R. McKay,
Ashkan Ertefaie
Abstract:
Existing methods in estimating the mean outcome under a given dynamic treatment regime rely on intention-to-treat analyses which estimate the effect of following a certain dynamic treatment regime regardless of compliance behavior of patients. There are two major concerns with intention-to-treat analyses: (1) the estimated effects are often biased toward the null effect; (2) the results are not ge…
▽ More
Existing methods in estimating the mean outcome under a given dynamic treatment regime rely on intention-to-treat analyses which estimate the effect of following a certain dynamic treatment regime regardless of compliance behavior of patients. There are two major concerns with intention-to-treat analyses: (1) the estimated effects are often biased toward the null effect; (2) the results are not generalizable and reproducible due to the potential differential compliance behavior. These are particularly problematic in settings with high level of non-compliance such as substance use disorder treatments. Our work is motivated by the Adaptive Treatment for Alcohol and Cocaine Dependence study (ENGAGE), which is a multi-stage trial that aimed to construct optimal treatment strategies to engage patients in therapy. Due to the relatively low level of compliance in this trial, intention-to-treat analyses essentially estimate the effect of being randomized to a certain treatment sequence which is not of interest. We fill this important gap by defining the target parameter as the mean outcome under a dynamic treatment regime given potential compliance strata. We propose a flexible non-parametric Bayesian approach, which consists of a Gaussian copula model for the potential compliances, and a Dirichlet process mixture model for the potential outcomes. Our simulations highlight the need for and usefulness of this approach in practice and illustrate the robustness of our estimator in non-linear and non-Gaussian settings.
△ Less
Submitted 1 October, 2021;
originally announced October 2021.
-
AdaScale SGD: A User-Friendly Algorithm for Distributed Training
Authors:
Tyler B. Johnson,
Pulkit Agrawal,
Haijie Gu,
Carlos Guestrin
Abstract:
When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapti…
▽ More
When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed scaling rules often degrade model quality. We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training. By continually adapting to the gradient's variance, AdaScale automatically achieves speed-ups for a wide range of batch sizes. We formally describe this quality with AdaScale's convergence bound, which maintains final objective values, even as batch sizes grow large and the number of iterations decreases. In empirical comparisons, AdaScale trains well beyond the batch size limits of popular "linear learning rate scaling" rules. This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks. AdaScale's qualitative behavior is similar to that of "warm-up" heuristics, but unlike warm-up, this behavior emerges naturally from a principled mechanism. The algorithm introduces negligible computational overhead and no new hyperparameters, making AdaScale an attractive choice for large-scale training in practice.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
Adjusting for Partial Compliance in SMARTs: a Bayesian Semiparametric Approach
Authors:
William J. Artman,
Ashkan Ertefaie,
Kevin G. Lynch,
James R. McKay,
Brent A. Johnson
Abstract:
The cyclical and heterogeneous nature of many substance use disorders highlights the need to adapt the type or the dose of treatment to accommodate the specific and changing needs of individuals. The Adaptive Treatment for Alcohol and Cocaine Dependence study (ENGAGE) is a multi-stage randomized trial that aimed to provide longitudinal data for constructing treatment strategies to improve patients…
▽ More
The cyclical and heterogeneous nature of many substance use disorders highlights the need to adapt the type or the dose of treatment to accommodate the specific and changing needs of individuals. The Adaptive Treatment for Alcohol and Cocaine Dependence study (ENGAGE) is a multi-stage randomized trial that aimed to provide longitudinal data for constructing treatment strategies to improve patients' engagement in therapy. However, the high rate of noncompliance and lack of analytic tools to account for noncompliance have impeded researchers from using the data to achieve the main goal of the trial. We overcome this issue by defining our target parameter as the mean outcome under different treatment strategies for given potential compliance strata and propose a Bayesian semiparametric model to estimate this quantity. While it adds substantial complexities to the analysis, one important feature of our work is that we consider partial rather than binary compliance classes which is more relevant in longitudinal studies. We assess the performance of our method through comprehensive simulation studies. We illustrate its application on the ENGAGE study and demonstrate that the optimal treatment strategy depends on compliance strata.
△ Less
Submitted 20 May, 2020;
originally announced May 2020.
-
Propensity Process: a Balancing Functional
Authors:
Pallavi S. Mishra-Kalyani,
Brent A. Johnson,
Qi Long
Abstract:
In observational clinic registries, time to treatment is often of interest, but treatment can be given at any time during follow-up and there is no structure or intervention to ensure regular clinic visits for data collection. To address these challenges, we introduce the time-dependent propensity process as a generalization of the propensity score. We show that the propensity process balances the…
▽ More
In observational clinic registries, time to treatment is often of interest, but treatment can be given at any time during follow-up and there is no structure or intervention to ensure regular clinic visits for data collection. To address these challenges, we introduce the time-dependent propensity process as a generalization of the propensity score. We show that the propensity process balances the entire time-varying covariate history which cannot be achieved by existing propensity score methods and that treatment assignment is strongly ignorable conditional on the propensity process. We develop methods for estimating the propensity process using observed data and for matching based on the propensity process. We illustrate the propensity process method using the Emory Amyotrophic Lateral Sclerosis (ALS) Registry data.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Estimating the effect of PEG in ALS patients using observational data subject to censoring by death and missing outcomes
Authors:
Pallavi Mishra-Kalyani,
Brent A. Johnson,
Jonathan D. Glass,
Qi Long
Abstract:
Though they may offer valuable patient and disease information that is impossible to study in a randomized trial, clinical disease registries also require special care and attention in causal inference. Registry data may be incomplete, inconsistent, and subject to confounding. In this paper we aim to address several analytical issues in estimating treatment effects that plague clinical registries…
▽ More
Though they may offer valuable patient and disease information that is impossible to study in a randomized trial, clinical disease registries also require special care and attention in causal inference. Registry data may be incomplete, inconsistent, and subject to confounding. In this paper we aim to address several analytical issues in estimating treatment effects that plague clinical registries such as the Emory amyotrophic lateral sclerosis (ALS) Clinic Registry. When attempting to assess the effect of a surgical insertion of a percutaneous endoscopic gastrostomy (PEG) tube on body mass index (BMI) using the data from the ALS Clinic Registry, one must combat issues of confounding, censoring by death, and missing outcome data that have not been addressed in previous studies of PEG. We propose a causal inference framework for estimating the survivor average causal effect (SACE) of PEG, which incorporates a model for generalized propensity scores to correct for confounding by pre-treatment variables, a model for principal stratification to account for censoring by death, and a model for the missing data mechanism. Applying the proposed framework to the ALS Clinic Registry Data, our analysis shows that PEG has a positive SACE on BMI at month 18 post-baseline; our results likely offer more definitive answers regarding the effect of PEG than previous studies of PEG.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
A Fast, Principled Working Set Algorithm for Exploiting Piecewise Linear Structure in Convex Problems
Authors:
Tyler B. Johnson,
Carlos Guestrin
Abstract:
By reducing optimization to a sequence of smaller subproblems, working set algorithms achieve fast convergence times for many machine learning problems. Despite such performance, working set implementations often resort to heuristics to determine subproblem size, makeup, and stopping criteria. We propose BlitzWS, a working set algorithm with useful theoretical guarantees. Our theory relates subpro…
▽ More
By reducing optimization to a sequence of smaller subproblems, working set algorithms achieve fast convergence times for many machine learning problems. Despite such performance, working set implementations often resort to heuristics to determine subproblem size, makeup, and stopping criteria. We propose BlitzWS, a working set algorithm with useful theoretical guarantees. Our theory relates subproblem size and stopping criteria to the amount of progress during each iteration. This result motivates strategies for optimizing algorithmic parameters and discarding irrelevant components as BlitzWS progresses toward a solution. BlitzWS applies to many convex problems, including training L1-regularized models and support vector machines. We showcase this versatility with empirical comparisons, which demonstrate BlitzWS is indeed a fast algorithm.
△ Less
Submitted 20 July, 2018;
originally announced July 2018.
-
Parametric inference for proportional (reverse) hazard rate models with nomination sampling
Authors:
Mohammad Nourmohammadi,
Mohammad Jafari Jozani,
Brad Johnson
Abstract:
\noindent Randomized nomination sampling (RNS) is a rank-based sampling technique which has been shown to be effective in several nonparametric studies involving environmental and ecological applications. In this paper, we investigate parametric inference using RNS design for estimating the unknown vector of parameters $\boldsymbolθ$ in the proportional hazard rate and proportional reverse hazard…
▽ More
\noindent Randomized nomination sampling (RNS) is a rank-based sampling technique which has been shown to be effective in several nonparametric studies involving environmental and ecological applications. In this paper, we investigate parametric inference using RNS design for estimating the unknown vector of parameters $\boldsymbolθ$ in the proportional hazard rate and proportional reverse hazard rate models. We examine both maximum likelihood (ML) and method of moments (MM) methods and investigate the relative precision of our proposed RNS-based estimators compared with those based on simple random sampling (SRS). We introduce four types of RNS-based data as well as necessary EM algorithms for the ML estimation, and evaluate the performance of corresponding estimators in estimating $\boldsymbolθ$. We show that there are always values of the design parameters on which RNS-based estimators are more efficient than those based on SRS. Inference based on imperfect ranking is also explored and it is shown that the improvement holds even when the ranking is imperfect. Theoretical results are augmented with numerical evaluations and a case study.
△ Less
Submitted 16 December, 2015;
originally announced December 2015.
-
Bounds for maximum likelihood regular and non-regular DoA estimation in $K$-distributed noise
Authors:
Yuri Abramovich,
Olivier Besson,
Ben Johnson
Abstract:
We consider the problem of estimating the direction of arrival of a signal embedded in $K$-distributed noise, when secondary data which contains noise only are assumed to be available. Based upon a recent formula of the Fisher information matrix (FIM) for complex elliptically distributed data, we provide a simple expression of the FIM with the two data sets framework. In the specific case of $K$-d…
▽ More
We consider the problem of estimating the direction of arrival of a signal embedded in $K$-distributed noise, when secondary data which contains noise only are assumed to be available. Based upon a recent formula of the Fisher information matrix (FIM) for complex elliptically distributed data, we provide a simple expression of the FIM with the two data sets framework. In the specific case of $K$-distributed noise, we show that, under certain conditions, the FIM for the deterministic part of the model can be unbounded, while the FIM for the covariance part of the model is always bounded. In the general case of elliptical distributions, we provide a sufficient condition for unboundedness of the FIM. Accurate approximations of the FIM for $K$-distributed noise are also derived when it is bounded. Additionally, the maximum likelihood estimator of the signal DoA and an approximated version are derived, assuming known covariance matrix: the latter is then estimated from secondary data using a conventional regularization technique. When the FIM is unbounded, an analysis of the estimators reveals a rate of convergence much faster than the usual $T^{-1}$. Simulations illustrate the different behaviors of the estimators, depending on the FIM being bounded or not.
△ Less
Submitted 14 May, 2015;
originally announced May 2015.
-
Risk prediction for prostate cancer recurrence through regularized estimation with simultaneous adjustment for nonlinear clinical effects
Authors:
Qi Long,
Matthias Chung,
Carlos S. Moreno,
Brent A. Johnson
Abstract:
In biomedical studies it is of substantial interest to develop risk prediction scores using high-dimensional data such as gene expression data for clinical endpoints that are subject to censoring. In the presence of well-established clinical risk factors, investigators often prefer a procedure that also adjusts for these clinical variables. While accelerated failure time (AFT) models are a useful…
▽ More
In biomedical studies it is of substantial interest to develop risk prediction scores using high-dimensional data such as gene expression data for clinical endpoints that are subject to censoring. In the presence of well-established clinical risk factors, investigators often prefer a procedure that also adjusts for these clinical variables. While accelerated failure time (AFT) models are a useful tool for the analysis of censored outcome data, it assumes that covariate effects on the logarithm of time-to-event are linear, which is often unrealistic in practice. We propose to build risk prediction scores through regularized rank estimation in partly linear AFT models, where high-dimensional data such as gene expression data are modeled linearly and important clinical variables are modeled nonlinearly using penalized regression splines. We show through simulation studies that our model has better operating characteristics compared to several existing models. In particular, we show that there is a nonnegligible effect on prediction as well as feature selection when nonlinear clinical effects are misspecified as linear. This work is motivated by a recent prostate cancer study, where investigators collected gene expression data along with established prognostic clinical variables and the primary endpoint is time to prostate cancer recurrence.
△ Less
Submitted 23 November, 2011;
originally announced November 2011.
-
Survival ensembles by the sum of pairwise differences with application to lung cancer microarray studies
Authors:
Brent A. Johnson,
Qi Long
Abstract:
Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intens…
▽ More
Lung cancer is among the most common cancers in the United States, in terms of incidence and mortality. In 2009, it is estimated that more than 150,000 deaths will result from lung cancer alone. Genetic information is an extremely valuable data source in characterizing the personal nature of cancer. Over the past several years, investigators have conducted numerous association studies where intensive genetic data is collected on relatively few patients compared to the numbers of gene predictors, with one scientific goal being to identify genetic features associated with cancer recurrence or survival. In this note, we propose high-dimensional survival analysis through a new application of boosting, a powerful tool in machine learning. Our approach is based on an accelerated lifetime model and minimizing the sum of pairwise differences in residuals. We apply our method to a recent microarray study of lung adenocarcinoma and find that our ensemble is composed of 19 genes, while a proportional hazards (PH) ensemble is composed of nine genes, a proper subset of the 19-gene panel. In one of our simulation scenarios, we demonstrate that PH boosting in a misspecified model tends to underfit and ignore moderately-sized covariate effects, on average. Diagnostic analyses suggest that the PH assumption is not satisfied in the microarray data and may explain, in part, the discrepancy in the sets of active coefficients. Our simulation studies and comparative data analyses demonstrate how statistical learning by PH models alone is insufficient.
△ Less
Submitted 9 August, 2011;
originally announced August 2011.